The problem with climbing grades is that unlike running,
cycling, lifting, etc. there's no objective measure of
difficulty. Routes are just graded by consensus of other
climbers, in this case the gym's routesetters. As
a result, some routes are easier than others—and of course
since different climbers have different styles, which
routes are easiest depends on the climber as well—and
as a practical matter some routes are really harder or easier
than their rated grade.

^{1} Of course, given that there's no
objective standard, you could argue that this isn't a
meaningful statement, but that's not really true:
a difficulty grade is really a statement about how many
people can do a route, so if you have a bunch of routes
which are rated at 5.10 and I can't climb any of them
but I jump on a new route rated 5.10, and race up it with no effort, that's a sign
it's not really a 5.10. This is actually a source of real
angst to people just starting to break into a grade—at
least for me—since
if I can do it, I immediately expect that the rating
is soft.

It would be nice to have a more objective measurement of
difficulty. While we can't do this just by measuring
the route (the way we can with running, for instance)
that doesn't mean the problem is insoluble; we just need
to take a more sophisticated approach.
Luckily, we can steal a solution from another problem domain:
psychological testing. The situations are actually
fairly similar: in both cases we have a trait (climbing
skill, intelligence) which isn't directly measurable. Instead, we can
give our subjects a bunch of problems which are generally easier
the higher your level of ability. In the psychological domain, what we
want to do is evaluate people's level of ability; in the
climbing domain, we want to evaluate the level of difficulty
of the problems. With the right methods, it turns out that
these are more or less the same problem.

The technique we want is called Item Response Theory (IRT). IRT assumes that
each item (question on the test or route, as the case may be)
has a certain difficulty level; if you succeed on an item,
that's an indication that your ability is above that level. If you
fail, that's an indication that your ability is below that
level. Given a set of items of known difficulties, then,
we can can quickly home in on someone's ability, which is how
computerized adaptive tests work. Similarly, if we take
a small set of people of known abilities and their performance
on each item, we can use that to fit the parameters for
those items.

It's typical to assume that the probability of success on each
item is a logistic curve. The figure below shows an item
with difficulty level 1.

Of course, this assumes that we already know how difficult
the items are, but initially we don't know anything: we just
have a set of people and items without any information
about how good/difficult any of them are.
In order to do the initial calibration we start by collecting a
large, random sample of people and have them try each item. You end
up with a big matrix of each person and whether they succeeded or
failed at each one, but since you don't know how good anyone is other
than by the results of this test, things get a little complicated. The
basic idea behind at least one procedure, due to Birnbaum,
(it's not entirely clear to
me if this is how modern software works; the R ltm documentation is a
little opaque) is to use an iterative technique where you assign
an initial set of abilities to each person and then use that to
estimate the difficulty of each problem. Given those assignments,
we can re-fit to determine people's abilities.
You then use those estimates to
reestimate the problem difficulties and iterate back and forth until
the estimates converge, at which point you have estimate of
*both* the difficulty of each item and the ability of each
individual.
(My description here is based on Baker).

As an example I generated some toy data with 20 items and 100 subjects
with a variety of abilities and fit it using R's
ltm
package. The figure below shows the results with the response
curves for each item. As you can see, having a range of items with
different difficulties lets us evaluate people along a wide range
of abilities:

Once you've done this rather expensive calibration stage, however,
you can easily calculate someone's abilities just by plugging in
their performance on a small set of items. Actually, you can
do better than that: you can perform an adaptive test where
you start with an initial set of items and then use the
response on those items to determine which items to
use next, but even if you don't do this, you can get results
fairly quickly.

That's nice if you're administering the SATs, but remember
that what we wanted was to solve the opposite problem: rating
the *items*, not the subjects. However, as I said earlier,
these are the same problem. Once we have a set of subjects
with known abilities, we can use that to roughly calibrate the
difficulty of any new set of items/routes. So, the idea
is that we create some set of benchmark routes and then
we send our raters out to climb those routes. At that
point we know their ability level and can use that to
rate any new set of climbs.

There's still one problem to solve: the difficulty ratings we
get out of our calculations are just numbers along some
arbitrary range (it's conventional to aim for a range
of about -3 to +3 with the average around 0), but we want
to have ratings in the Yosemite Decimal System (5.1-5.15a as
of now). It's of course easy to rescale the difficulty
parameter to match any arbitrary scale of our choice, but
that's not really enough, because the current ratings are
so imprecise. We'll almost certainly find that there
are two problems A and B where A is currently
rated harder than B but our calibrated scale has B harder
than A. We can of course choose a mapping that minimizes
these errors, but because so many routes are misrated it's probably better to start with a
smaller set of benchmark routes where there is a lot of
consensus on their difficulty, make sure they map correctly,
and then readjust the ratings of the rest of the routes
accordingly.

Note that this doesn't account for the fact that
problems can be difficult in different ways; one
problem might require a lot of strength and one
require a lot of balance. To some extent, this is
dealt with by the having a smooth success curve
which doesn't require that every 5.10 climber be
able to climb every 5.10 route. However, ultimately
if you have a single scalar ability/difficulty
metric, there's only so much you can do in this
regard. IRT can handle multiple underlying abilities, but
the YSD scale we're trying to emulate can't, so
there's not too much we can do along those lines.

Obviously, this is all somewhat speculative—it's
a lot of work and I don't get
the impression that route setters worry too much about the
accuracy of their ratings. On the other hand, at least
in climbing gyms if you were
able to integrate it into a system that let people keep
track of their success in their climbs (I do this already
but most people find it to be too much trouble), you
might be able to get the information you needed to
calibrate new climbers and through them get a better
sense of the ratings for new climbs.

**Acknowledgement:** This post benefitted from discussions with
Leslie Rescorla,
who initially suggested the IRT direction.

^{1.} This seems to be especially bad for very
easy and very hard routes. I think the issue with very easy
routes is that routesetters are generally good climbers and so
find all the routes super-easy. I'm not sure about harder
problems, but it may be that they're near the limit of
routesetters abilities and so heavily dependent on whether
the route matches their style.