^{1}Of course, given that there's no objective standard, you could argue that this isn't a meaningful statement, but that's not really true: a difficulty grade is really a statement about how many people can do a route, so if you have a bunch of routes which are rated at 5.10 and I can't climb any of them but I jump on a new route rated 5.10, and race up it with no effort, that's a sign it's not really a 5.10. This is actually a source of real angst to people just starting to break into a grade—at least for me—since if I can do it, I immediately expect that the rating is soft.

It would be nice to have a more objective measurement of difficulty. While we can't do this just by measuring the route (the way we can with running, for instance) that doesn't mean the problem is insoluble; we just need to take a more sophisticated approach. Luckily, we can steal a solution from another problem domain: psychological testing. The situations are actually fairly similar: in both cases we have a trait (climbing skill, intelligence) which isn't directly measurable. Instead, we can give our subjects a bunch of problems which are generally easier the higher your level of ability. In the psychological domain, what we want to do is evaluate people's level of ability; in the climbing domain, we want to evaluate the level of difficulty of the problems. With the right methods, it turns out that these are more or less the same problem.

The technique we want is called Item Response Theory (IRT). IRT assumes that each item (question on the test or route, as the case may be) has a certain difficulty level; if you succeed on an item, that's an indication that your ability is above that level. If you fail, that's an indication that your ability is below that level. Given a set of items of known difficulties, then, we can can quickly home in on someone's ability, which is how computerized adaptive tests work. Similarly, if we take a small set of people of known abilities and their performance on each item, we can use that to fit the parameters for those items.

It's typical to assume that the probability of success on each item is a logistic curve. The figure below shows an item with difficulty level 1.

Of course, this assumes that we already know how difficult
the items are, but initially we don't know anything: we just
have a set of people and items without any information
about how good/difficult any of them are.
In order to do the initial calibration we start by collecting a
large, random sample of people and have them try each item. You end
up with a big matrix of each person and whether they succeeded or
failed at each one, but since you don't know how good anyone is other
than by the results of this test, things get a little complicated. The
basic idea behind at least one procedure, due to Birnbaum,
(it's not entirely clear to
me if this is how modern software works; the R ltm documentation is a
little opaque) is to use an iterative technique where you assign
an initial set of abilities to each person and then use that to
estimate the difficulty of each problem. Given those assignments,
we can re-fit to determine people's abilities.
You then use those estimates to
reestimate the problem difficulties and iterate back and forth until
the estimates converge, at which point you have estimate of
*both* the difficulty of each item and the ability of each
individual.
(My description here is based on Baker).

As an example I generated some toy data with 20 items and 100 subjects with a variety of abilities and fit it using R's ltm package. The figure below shows the results with the response curves for each item. As you can see, having a range of items with different difficulties lets us evaluate people along a wide range of abilities:

Once you've done this rather expensive calibration stage, however, you can easily calculate someone's abilities just by plugging in their performance on a small set of items. Actually, you can do better than that: you can perform an adaptive test where you start with an initial set of items and then use the response on those items to determine which items to use next, but even if you don't do this, you can get results fairly quickly.

That's nice if you're administering the SATs, but remember
that what we wanted was to solve the opposite problem: rating
the *items*, not the subjects. However, as I said earlier,
these are the same problem. Once we have a set of subjects
with known abilities, we can use that to roughly calibrate the
difficulty of any new set of items/routes. So, the idea
is that we create some set of benchmark routes and then
we send our raters out to climb those routes. At that
point we know their ability level and can use that to
rate any new set of climbs.

There's still one problem to solve: the difficulty ratings we get out of our calculations are just numbers along some arbitrary range (it's conventional to aim for a range of about -3 to +3 with the average around 0), but we want to have ratings in the Yosemite Decimal System (5.1-5.15a as of now). It's of course easy to rescale the difficulty parameter to match any arbitrary scale of our choice, but that's not really enough, because the current ratings are so imprecise. We'll almost certainly find that there are two problems A and B where A is currently rated harder than B but our calibrated scale has B harder than A. We can of course choose a mapping that minimizes these errors, but because so many routes are misrated it's probably better to start with a smaller set of benchmark routes where there is a lot of consensus on their difficulty, make sure they map correctly, and then readjust the ratings of the rest of the routes accordingly.

Note that this doesn't account for the fact that problems can be difficult in different ways; one problem might require a lot of strength and one require a lot of balance. To some extent, this is dealt with by the having a smooth success curve which doesn't require that every 5.10 climber be able to climb every 5.10 route. However, ultimately if you have a single scalar ability/difficulty metric, there's only so much you can do in this regard. IRT can handle multiple underlying abilities, but the YSD scale we're trying to emulate can't, so there's not too much we can do along those lines.

Obviously, this is all somewhat speculative—it's a lot of work and I don't get the impression that route setters worry too much about the accuracy of their ratings. On the other hand, at least in climbing gyms if you were able to integrate it into a system that let people keep track of their success in their climbs (I do this already but most people find it to be too much trouble), you might be able to get the information you needed to calibrate new climbers and through them get a better sense of the ratings for new climbs.

**Acknowledgement:** This post benefitted from discussions with
Leslie Rescorla,
who initially suggested the IRT direction.

^{1.} This seems to be especially bad for very
easy and very hard routes. I think the issue with very easy
routes is that routesetters are generally good climbers and so
find all the routes super-easy. I'm not sure about harder
problems, but it may be that they're near the limit of
routesetters abilities and so heavily dependent on whether
the route matches their style.

## Leave a comment