Sports: July 2009 Archives


July 12, 2009

It's reasonably common for MMA fights to be stopped due to excessive bleeding by one of the fighters. In fact, in some cases fighters will deliberately try to open up a cut on their opponent in order to get a stoppage. Apparently, some fighters are more susceptible to cuts than others. The NYT has an interesting article about plastic surgery to make them more resistant to bleeding:
So last summer, Davis, 35, contacted a plastic surgeon in Las Vegas. He wanted to make his skin less prone to cutting.

The surgeon, Dr. Frank Stile, burred down the bones around Davis's eye sockets. He also removed scar tissue around his eyes and replaced it with collagen made from the skin of cadavers.

There appear to be two claimed underlying problems: (1) sharp bone ridges in the skull which result in cuts when strikes to the face force the skin against the bone and (2) poor treatment of cuts in the ring resulting in "unstable scar tissue" which is thus more likely to result in a propensity to future cuts.

As usual with medical procedures applied to athletes, we are immediately faced with the question of whether this is simple treatment or an enhancement. To the extent to which you're fixing incompletely healed injuries, that certainly looks like medical treatment. The bone shaving, on the other hand, starts to look more like enhancement. On the other hand, I guess you could think of sharp bones the same way you would think of, say, asthma, in which case treatment starts to look appropriate. On the third hand, I think we can agree that implanting a plastic plate over your forehead, while an effective anti-cut measure, would probably be outside the rules. All this just reinforces that these distinctions are basically arbitrary; if we ban this kind of surgery, it's an advantage to people with good bone structure. Contrariwise, if we allow this kind of surgery, people who formerly had the advantage of good bone structure lose that advantage.

Of course, all this assumes that the surgery actually works. But if it doesn't, likely something that works will eventually come along.


July 5, 2009

The problem with climbing grades is that unlike running, cycling, lifting, etc. there's no objective measure of difficulty. Routes are just graded by consensus of other climbers, in this case the gym's routesetters. As a result, some routes are easier than others—and of course since different climbers have different styles, which routes are easiest depends on the climber as well—and as a practical matter some routes are really harder or easier than their rated grade.1 Of course, given that there's no objective standard, you could argue that this isn't a meaningful statement, but that's not really true: a difficulty grade is really a statement about how many people can do a route, so if you have a bunch of routes which are rated at 5.10 and I can't climb any of them but I jump on a new route rated 5.10, and race up it with no effort, that's a sign it's not really a 5.10. This is actually a source of real angst to people just starting to break into a grade—at least for me—since if I can do it, I immediately expect that the rating is soft.

It would be nice to have a more objective measurement of difficulty. While we can't do this just by measuring the route (the way we can with running, for instance) that doesn't mean the problem is insoluble; we just need to take a more sophisticated approach. Luckily, we can steal a solution from another problem domain: psychological testing. The situations are actually fairly similar: in both cases we have a trait (climbing skill, intelligence) which isn't directly measurable. Instead, we can give our subjects a bunch of problems which are generally easier the higher your level of ability. In the psychological domain, what we want to do is evaluate people's level of ability; in the climbing domain, we want to evaluate the level of difficulty of the problems. With the right methods, it turns out that these are more or less the same problem.

The technique we want is called Item Response Theory (IRT). IRT assumes that each item (question on the test or route, as the case may be) has a certain difficulty level; if you succeed on an item, that's an indication that your ability is above that level. If you fail, that's an indication that your ability is below that level. Given a set of items of known difficulties, then, we can can quickly home in on someone's ability, which is how computerized adaptive tests work. Similarly, if we take a small set of people of known abilities and their performance on each item, we can use that to fit the parameters for those items.

It's typical to assume that the probability of success on each item is a logistic curve. The figure below shows an item with difficulty level 1.

Of course, this assumes that we already know how difficult the items are, but initially we don't know anything: we just have a set of people and items without any information about how good/difficult any of them are. In order to do the initial calibration we start by collecting a large, random sample of people and have them try each item. You end up with a big matrix of each person and whether they succeeded or failed at each one, but since you don't know how good anyone is other than by the results of this test, things get a little complicated. The basic idea behind at least one procedure, due to Birnbaum, (it's not entirely clear to me if this is how modern software works; the R ltm documentation is a little opaque) is to use an iterative technique where you assign an initial set of abilities to each person and then use that to estimate the difficulty of each problem. Given those assignments, we can re-fit to determine people's abilities. You then use those estimates to reestimate the problem difficulties and iterate back and forth until the estimates converge, at which point you have estimate of both the difficulty of each item and the ability of each individual. (My description here is based on Baker).

As an example I generated some toy data with 20 items and 100 subjects with a variety of abilities and fit it using R's ltm package. The figure below shows the results with the response curves for each item. As you can see, having a range of items with different difficulties lets us evaluate people along a wide range of abilities:

Once you've done this rather expensive calibration stage, however, you can easily calculate someone's abilities just by plugging in their performance on a small set of items. Actually, you can do better than that: you can perform an adaptive test where you start with an initial set of items and then use the response on those items to determine which items to use next, but even if you don't do this, you can get results fairly quickly.

That's nice if you're administering the SATs, but remember that what we wanted was to solve the opposite problem: rating the items, not the subjects. However, as I said earlier, these are the same problem. Once we have a set of subjects with known abilities, we can use that to roughly calibrate the difficulty of any new set of items/routes. So, the idea is that we create some set of benchmark routes and then we send our raters out to climb those routes. At that point we know their ability level and can use that to rate any new set of climbs.

There's still one problem to solve: the difficulty ratings we get out of our calculations are just numbers along some arbitrary range (it's conventional to aim for a range of about -3 to +3 with the average around 0), but we want to have ratings in the Yosemite Decimal System (5.1-5.15a as of now). It's of course easy to rescale the difficulty parameter to match any arbitrary scale of our choice, but that's not really enough, because the current ratings are so imprecise. We'll almost certainly find that there are two problems A and B where A is currently rated harder than B but our calibrated scale has B harder than A. We can of course choose a mapping that minimizes these errors, but because so many routes are misrated it's probably better to start with a smaller set of benchmark routes where there is a lot of consensus on their difficulty, make sure they map correctly, and then readjust the ratings of the rest of the routes accordingly.

Note that this doesn't account for the fact that problems can be difficult in different ways; one problem might require a lot of strength and one require a lot of balance. To some extent, this is dealt with by the having a smooth success curve which doesn't require that every 5.10 climber be able to climb every 5.10 route. However, ultimately if you have a single scalar ability/difficulty metric, there's only so much you can do in this regard. IRT can handle multiple underlying abilities, but the YSD scale we're trying to emulate can't, so there's not too much we can do along those lines.

Obviously, this is all somewhat speculative—it's a lot of work and I don't get the impression that route setters worry too much about the accuracy of their ratings. On the other hand, at least in climbing gyms if you were able to integrate it into a system that let people keep track of their success in their climbs (I do this already but most people find it to be too much trouble), you might be able to get the information you needed to calibrate new climbers and through them get a better sense of the ratings for new climbs.

Acknowledgement: This post benefitted from discussions with Leslie Rescorla, who initially suggested the IRT direction.

1. This seems to be especially bad for very easy and very hard routes. I think the issue with very easy routes is that routesetters are generally good climbers and so find all the routes super-easy. I'm not sure about harder problems, but it may be that they're near the limit of routesetters abilities and so heavily dependent on whether the route matches their style.