My last post generated a few comments and they seemed like they deserved a little more room than my own comments in reply (plus I made some graphs!), so here we go.
First, the issue of using percentages as a measure. We were specifically talking about rebound percentage, which basketball-reference.com calculates as the percentage of available rebounds a player gets (you can also do it separately for offensive and defensive, which would be out of own-team misses and other-team misses, respectively). In general, I like rebound percentage. It naturally accounts for things that differ across players and teams, like pace, teammate shooting ability, time on the court, etc. Reggie Evans, Dwight Howard, and Zach Randolph all get 12.1 boards per game, for example, yet Randolph plays a minute more per game than Dwight and 9 minutes more than Evans; on the other hand, Memphis plays much slower than Toronto or Orlando. But by looking at rebound percentage, I can see that Evans is a better rebounder because he gets 26.3% of missed shots while he’s on the floor while Howard gets 20.7% and Randolph gets 19.3%. Presumably if Evans played more, or if Randolph’s team played faster, they would get more rebounds. But at any pace or playing time, Evans is apparently the best rebounder of the three.
However, rebound percentage has the disadvantage of being a percentage. That means it must be between 0 and 1. When we look at how a percentage varies with some other variable, the function is often somewhat S shaped; it starts low and climbs quickly, becomes more linear in the middle, and then slows down as probability approaches 1. A number of functions can be used to describe the shape, like the logistic or probit. As an example, here’s a graph with two lines, one made using a logistic-like function. The line to the left has the same ‘slope’ as the line to the right, but a different intercept. This changes where y=.5 hits on the x axis.
Below is a graph with two more functions. Both have intercepts of 0, so y=.5 is at x=0, but much smaller slopes than the functions above. One is a more gradual increase than the two above, which I would say is more typical of functions like these, and the other has a slope so small that it looks linear in the range I plotted. It would only approach 0 and 1 at much smaller and bigger values of x, but rest assured that it would flatten out when it got there.
The main point I want to make here is that in every case, functions flatten out as they approach y=1, as they must since probabilities cannot be greater than 1. Their main difference is in the slope, which describes how quickly they move from low to high probability. The lines in the first graph have high slopes and so there is a quick jump from 0 to 1; the slopes in the second are smaller and so the jump is more gradual.
So now we can turn to diminishing returns. Since we are using probabilities, there will ALWAYS be diminishing returns regardless of if we are talking about rebounds, shots, or anything else. The only question is at what point on the x axis it starts; perhaps the slope should be the measure of ‘how much’ diminishing returns a statistic exhibits (although I’d still be wary of measuring it this way). The leveling-out towards 1 starts extremely early in the first graph, a bit later for the S-shaped line in the second, and will start very late for the nearly-straight line.
This is why I said that I wouldn’t necessarily use rebound percentage to look for diminishing returns because a) I would always find it if I looked at the full function and b) the amount of diminishing returns I find would depend on what portion of the x axis my data fell in. In the data presented at countthebasket, for example, projected offensive rebound percentage is at a lower range than projected defensive rebound percentage. Eli concludes that there are greater diminishing returns for defensive rebounds, but they could in fact be described by the same equation; defensive rebounds simply are higher on the x axis and so the actual probabilities are becoming smushed while actual offensive rebound probability is in a more linear section of the curve. If we were looking at a league where steals happened all the time, we would be talking about diminishing returns in steal percentage.
The S-shape of probabilities not only guarantees diminishing returns, but at the low end of the x axis it makes the interesting prediction of “accelerating returns”. That is, adding N points of whatever x is at the low end will move your probability up more than adding N points at the middle or high end of x. I don’t have the actual data to fit, but I think you might get this for the steal and block data here.
So if I were going to look for diminishing returns in the same way that the two linked articles above did, I would not use rebound probability. Since the projected and actual values come from a player’s season value compared to line-up level value from play-by-play data, there’s no need to account for pace and the like (unless certain line-ups within a team play drastically differently even when the same player is in both line-ups). As I have said before and commenter Guy suggested, rebounds per 36 minutes should be fine. Also I would not add up probabilities regardless of the statistic being investigated. Probabilities do not add linearly (in the second link, you actually have a projected assist rate greater than 1, which is impossible).
Regarding the weighting of variables and how they relate to summary statistics, it might not be the weight per se so much as the relative weighting. I said that my fake WP48 was more consistent because it put a higher weight (.3) on fake rebounding than fake NBA Efficiency (.25) or fake PER (.15), and fake rebounding was the most consistent measure. But it’s also true that fake WP48 has a higher relative weighting of fake rebounding to fake shooting efficiency (which is less consistent); the ratio is .3/.1 = 3 for fake WP48, 1 for fake NBA Efficiency, and .3 for fake PER. I’d have to play with more numbers to see if the absolute weight or the relative weight is important. In either case, I recommend reading Ty’s post on the Blazers for a counterargument to the importance of rebounding in WP48.
Finally, some words on consistency. In the way I was using it last post, consistent is the same thing as predictive, because I was looking at consistency across years. If something is consistent, I can be confident about what it will be like next year, which means I can make predictions. Jason Kidd rebounds well this year so he will probably rebound well next year; he is shooting poorly this year but he might shoot better (or worse) next year. Consistency is to be valued, but it shouldn’t be imposed (by using some algorithm that maximizes a correlation) and it doesn’t have to occur. For example, I have never seen research that finds that much of anything in football is consistent. Some statistics do correlate (at least within a year), but usually weakly; I would say that this means they are only somewhat consistent. Less consistent measures lead to noisier predictions, which is why picking 55% against the spread is amazing. All the statistics, as well as our eyes, tell us that the Patriots are one of the best teams this year, yet because of small sample sizes and inconsistency it is hard to be confident about how well they will play next game (or next year).
On the other hand, some things do appear to be consistent. Year-to-year rebounding, blocked shots, and assists (per minute) at the player level all appear to be pretty stable (correlations of .87, .87, and .9 according to Stumbling on Wins). There isn’t really a way to know why this is; I could speculate that it has to do with a combination of a player’s ability and having a stable role on a team even if he switches teams (big men will continue to be asked to rebound and defend the basket; point guards will continue to move the ball). Similarly I can only speculate as to why shooting percentage is less consistent, and I have absolutely no idea why free throw percentage only has a .59 correlation across seasons. All we can do is measure these things and then use them the best we can. My claim in the last post was simply that if you combine a number of measures, it appears that your summary measure will be consistent to the extent that it heavily weights (or relatively weights) consistent measures.