An APM Primer/Bleg

Apologies for how quiet it’s been around here; it turns out that even grad students have to do stuff sometimes.  Although it seems to have paid off, because I have a postdoctoral position lined up for the fall, which is great.  In the meantime, I’ll try to be better; it shouldn’t be too hard as the playoffs (both NBA and NHL) get closer.

For now, I wanted to look at adjusted plus/minus (APM) and see if I could provide a bit of a summary and ask a few questions.  APM is an alternative to boxscore measures of player productivity.  Instead of looking at how many points, rebounds, steals, etc, a player accumulates and then weighting them in some manner to get a single number that measures his production (the method used for Wins Produced, Win Shares, PER, etc.), APM uses regression to directly estimate how a team does when he is on the floor.  The ‘pure’ APM format is to create a data set where each line contains a snippet of a game, over each timespan where no players are subbed in or out.  The dependent variable is the point differential for the home team in that timeframe (typically adjusted by the number of possessions to get a points per possession measure).  The independent variables are dummy codes for the players on the floor in that timeframe.  So at the start of a game between the away team Knicks and home team Celtics, you would have 1’s in each column for Kevin Garnett, Rajon Rondo, and the other Celtic starters; -1 in each column for Amar’e Stoudemire, Carmelo Anthony, and the other Knicks starters; and the points per possession value would be however much those five Celtic players outscore (or are outscored by) those five Knicks players divided by the number of possessions.  Once a player is substituted in on either team, a new data line is started.  And your data set has every possession from every game played all season long.  APM has been discussed and described a few places; you can look at posts by Arturo, Aaron at basketballvalue.com, Dan Rosenbaum, Eli Witus, a number of places in the APBR community, Dave Berri, and I’m sure elsewhere.  Obviously a lot of credit goes to them; I’m summarizing their work and thoughts here.

This regression is meant to calculate each player’s contribution to the team’s bottom line (outscoring the opponent or being outscored) while accounting for his teammates and his opponents.  The main benefit is that APM should account for things that don’t appear in the boxscore; does player X set good screens, does he space the floor, does he close out on shooters, does he disrupt the opponents’ offense.  Defensive value is perhaps the best part of APM, since the only individual boxscore measures of defense are steals and blocks.  The second best part is the fact that regression is meant to account for the other variables, in this case teammates and opponents.  Sure, Kevin Love gets a lot of rebounds, but maybe it’s because his teammates force opponents into bad shots, and that’s where the value is?  Maybe he scores so much because defenses key in on Beasley?  In theory, APM gives a measure of a player’s value completely separated from other players in the league, regardless of how they might contribute.

That last sentence also summarizes the downside to APM.  One big problem is theoretical; APM is a black box.  The data goes in and the numbers come out, but we can’t say why they turn out the way they do.  If Kobe is above average, is it due to his scoring?  Is it his clutch ability?  APM can be separated for offense and defense, so there’s some value there, but if someone is an above-average defender you can’t say why.  With box score measures, you can point to where a player gets value and declare that to be why he is producing.

The other issue is a practical matter: players tend to play with the same guys over and over.  Starters are a good example; they are often on the court at the same time.  An extreme example from the same technique in hockey comes from the Sedin brothers; Daniel appears to share over 90% of his ice time with Henrik.  What this means is that those players (which are variables in the regression) are highly collinear: their values follow each other very closely across observations.  Players who play together a lot have virtually identical contributions to the model (they are both 1 or 0 most of the time), and thus the model cannot tell them apart.  This leads to two issues mathematically: unstable coefficients, meaning that players may be given incorrect APM scores, and high errors, meaning that we can’t be very certain about how good a player actually is.  The solution, practically speaking, is to add more data: if you include previous seasons to add more data points and gain some leverage from players being separated due to injuries and trade, the estimates become better.  Kobe Bryant is a good example.  His APM this year is -5.23 with an error of 6.86.  If we had to guess, Kobe is a very bad player (a score of 0 means that his team would play even on a neutral court if he were on the court and the other 9 players were equally matched to each other).  But we can’t be sure because the error is so big; we can only be somewhat sure that he’s somewhere between awful and slightly above average.  If you add in last year as well, though, he has a score of 4.06 with an error of 3.59.  Over the past season and so far this year, Kobe is a positive contributor, and we can be somewhat sure that he’s above average.  It also turns out, as described in Arturo’s post, that the APM regression does a very bad job of describing what happens on the court.  For whatever reason (noisy data or otherwise), the R squared is very low; you would not be terribly wrong if you just declared every player equally good.

A few methods have been suggested for dealing with these issues (beyond adding more seasons).  One is to try statistical plus-minus (SPM), which uses regression to predict APM from box score metrics.  The Rosenbaum link above does this as part 2 of his final APM measure, and Evan has done something similar with regularized APM and his model.  Since the boxscore tells us why someone is effective (e.g., we can see that the shot a good percentage, or get a lot of steals), connecting that to APM can be informative.  Another option is the regularized APM I just mentioned; it’s also called ridge regression.  What this does in practice is move all players close to average (0).  However, even with multiple years of data, RAPM is not as predictive as you might like.

In summary, APM is a statistic that has great promise but big practical issues.  These issues have not gone unnoticed; beyond Arturo and Dave Berri’s posts, some people at the APBR site have been very cautious about its use (including RAPM).  But other people are not; it’s used as the basis for various SPM models and the same approach is used to analyze rebounding.  This leads me to the bleg portion of the post, aimed mostly at people who do use APM:  in short, why keep using it?  The one-year results, even for RPM, are so noisy as to be unusable.  It has very little predictive power; the people you think are good this year could be great, terrible, or anywhere between the next year.  Despite the noise, some people use it to evaluate their own model or build new ones; why rely on something so unreliable to determine your model?  Has anyone attempted to see if APM becomes more predictive with more non-overlapping years?  For example, if you create 2-year APM from 07-08 and 08-09 and used it to predict the 2-year APM from 09-10 and 10-11, how well does that turn out?  Comparing APM and boxscore metrics is common in evaluating a player and my sense is that APM is given the benefit of the doubt.  For example, a player who scores highly on APM but not WP or WS or whatever *must* be a good defender or spread the floor; rarely is it assumed that his score is a mistake (unless he’s perceived to be good but scores poorly, like Kobe this year).  If you only use multiple-year APM, how do you know who was good just last year, or this year so far?  Weighting seasons is meant to cover that issue, but I bet it does little to improve the errors.

So help me out guys: why use it?

(A quick P.S.: I understand the boxscore metrics all have their own drawbacks.  I know why people use them, though; I’m less clear on why people continue to use variations of APM or its method.)

This entry was posted in Uncategorized and tagged , . Bookmark the permalink.

33 Responses to An APM Primer/Bleg

1. EvanZ says:

Alex, there were some interesting papers that came out of Sloan this year. Three in particular deal with player metrics and are attempts to improve APM or provide an alternative. You should check these out, the pdf’s are available now (Omidiran, Piette, & Engelmann are the three authors):

http://www.sloansportsconference.com/research-papers/2011-2/posters/

One reason I can think of that APM is useful is simply the fact that it is unbiased relative to our different box score metrics. We all know, for example, that WP differs from other metrics on valuing rebounds. Well, APM doesn’t really have a dog in that fight. It’s not part of the regression. Although, I will point out that one of the Sloan papers actually does attempt to simultaneously figure out the weights of box score stats during the regression step, which is an interesting approach. Unfortunately, no player data are presented to enable us to apply a smell test.

• Alex says:

Thanks for the link; I’ll be sure to take a look.

The unbiased feature is nice, but it’s hard to evaluate. If WP says one player is good while ezPM or WS says he isn’t, and APM agrees with WP, does that mean that ezPM is wrong? If ezPM says a player is good and APM agrees but WP disagrees, does that mean WP is wrong? With the noise/interpretation issues even after adding seasons and regularization, how could you be sure it wasn’t chance? I believe someone at APBR correlated some of the boxscore metrics with APM at some point and they were all in the same ballpark; does that mean that they’re all fairly right? I usually just don’t know how to feel when APM is used as the basis of a model or claim about some player. Even when it agrees with something that seems obvious, it feels like luck.

• EvanZ says:

I think every box score model is going to roughly correlate with APM at the league-wide level, but each will have “exceptions”. It’s interesting to look at these exceptions, because it can give you an idea about what the box score metric might be missing.

For example, right now the box score metrics (including mine) have Ekpe Udoh rated as one of the worst players in the league. But anyone who watches him knows he has had a tremendously positive impact on GSW with his help defense. It’s just not showing up in the box score, and because his rebounding is way below average (probably because he is helping to deter shots, which leaves other to grab the boards), he looks like a scrub. When we look at RAPM (or 2-yr APM), we see a very positive value, suggesting that there is some disconnect between his actual productivity and the box score.

It’s important, because players like this are exactly the type of player that I might want to sign as a GM, since they will be extremely undervalued (sort of the inverse of high PPG players). APM can be a tool to help you identify these players. It doesn’t mean you should use it blindly, of course.

One last thing…you know or have been told the old saying, “all models are wrong, but some less so”. I’m pretty much done with criticizing WP. I don’t really have a dog in the fight. People will use it however they want to. I see with my own metric it’s strengths and flaws, and would hope that others would see the same. It’s one tool in the toolbox. APM is another tool. Your eyes and your brain are other tools. These are not mutually exclusive.

• Alex says:

That’s exactly the kind of thinking I’m confused by. How do you know that Udoh’s APM/RAPM is reliable? Why can’t I make the argument that all the box score metrics say that Monta is a bit below average but decent enough, but his 2-year APM is terrible, and thus he must do something god-awful that isn’t in the boxscore? (not to pick on Monta, he was just the first person to fit the bill). Is that true, as you say it’s true that Udoh must do something well that isn’t in the boxscore? Is it true in every case? If there’s a disconnect between the box score and APM/RAPM, is there actually a disconnect or is it noise in APM? When people disagree with WP (for example), it’s easy to see why (Kevin Love rebounds a ton, although he apparently is not all that great according to conventional wisdom). When APM does something funny, there’s no way to say why. And people seem to interpret it as correct knowledge, and then ascribe any differences to something not in the boxscore instead of a flaw with APM.

That’s the same question I was trying to ask with the WP/WS/ezPM ‘debate’. When they disagree, how useful is APM in informing the discussion? I would ask the same question, to a lesser degree, with watching the games. You watch a lot of Warriors games, so your opinion on Udoh is probably fairly accurate. But we know there are a ton of well-informed people out there who think that 90% of the game is putting the ball in the basket, which we also know is false. So simply watching the games isn’t sufficient most of the time either. I am completely on board with using multiple tools, but not if those tools are unsalvageable. What’s the saving grace for APM?

• EvanZ says:

“How do you know that Udoh’s APM/RAPM is reliable? ”

I don’t “know” it’s reliable, but I have a stronger belief in it, because it is consistent with my prior knowledge of how he affects the games.

• Alex says:

I meant that honestly. I’m saying that I’m unclear on how APM is useful given its unreliability. You’ve said that you believe in APM results when they agree with you (as in the case of Udoh), in which case you don’t really need APM, and that you think APM can be used to find underrated players, which I interpreted as you saying that you think APM values can be used. But you didn’t say, I don’t think, why you think they can/should be used. Do you think that the method actually is reliable? If you were a GM, you would be unconcerned that the regression is poorly formed and has huge errors? Why?

• EvanZ says:

“Why can’t I make the argument that all the box score metrics say that Monta is a bit below average but decent enough, but his 2-year APM is terrible, and thus he must do something god-awful that isn’t in the boxscore?”

Yes, many people do make this argument, and it has some merit.

• Alex says:

Soooo…. does that mean there’s some merit to the argument that Kobe is doing something god-awful that isn’t in the boxscore this year? Also keep in mind the extent of the disconnect; Monta is just below average by many metrics (above average according to PER); Chauncey Billups is above average. Yet APM says they are the 12th and 13th-worst players in the league over the past two years.

Beyond coming up with questionable examples, we still haven’t answered the question of why such arguments have merit. Why? Where does this merit come from?

2. Greyberger says:

On-court off-court data derived from play-by-plays is important because it demonstrates how crude our current understanding of basketball statistics still is and how misguided it is to think you can capture player contributions in box scores.

Kevin Love gets 17 rebounds per 40 minutes, while the average big man gets about 10. So he’s worth an extra 7 rebounds per 40 minutes, right? Nowhere close, thanks to interaction effects and the opportunity conundrum.

When Love goes into the game, one of his goals is to get as many rebounds as possible. He ends up taking away rebounds from the opponent and his teammates, and his presence on the floor changes rebounding technique and strategy.

If all you had to go by was counting stats (box scores), you’d get the curious impression that the other T-Wolves got much worse at defensive rebounding when Love checks in. But more than anything it’s a change in their opportunities and the team’s strategy. To get any deeper into these questions you need on-court off-court data as well.

The problem with combining counting stats and on-court off-court stats is that they don’t match up and refuse to reconcile, and that’s why our fan understanding of basketball stats is still so primitive. The expectation for Miami this year was, they’re replacing inefficient possessions with extremely efficient Lebron and Bosh possessions. So much attention would be directed towards stopping it that the others would score more easily.

But of course we don’t have any idea what happens when you assemble a team like that. Box score schemes and APM guesswork completely failed to predict the outcome or the reasons. We need to stop pretending like we’ve solved anything and use all the information available to us before the advanced stats backlash is truly underway and the entire conversation becomes toxic.

• Alex says:

Better information will always be great. I don’t mean to say that boxscore metrics are the be-all and end-all of analysis; certainly we need to do better than single-number work (like WS or WP or PER) alone. But I missed the part where you said why APM overcomes its huge drawbacks and informs the discussion.

• EvanZ says:

“But I missed the part where you said why APM overcomes its huge drawbacks and informs the discussion.”

Clearly.

• Greyberger says:

I’m trying to establish why we need plus-minus, since I thought that was your question. Adjusted plus minus is just a way to make sense of plus-minus systematically. There are options now when it comes to APM and they each have their limitations – I wouldn’t call it drawbacks.

The sentiment I see a lot is, it’s all just noise. That looking at team vs. team performance when a certain player or combination of players is in the game is pointless; you’ll never be able to tell who’s contributing what or if it’s real or just the ups and downs of stints and seasons.

If a two-SE confidence interval is your idea of rising above the noise, then in 2-year APM all the ‘Top Players’ at the front page of BValue.com qualify. That’s what APM is for – a systematic approach to untangling player contributions so that we’re not just winging it. If you already think this kind of thing is pointless and prefer box-score metrics it’s hard to understand the appeal.

But my point above that you didn’t respond to is the one I cared about making: Box score stats and metrics derived from them do not do a good job of describing reality. One player takes the shot, but it’s the team offense that creates the opportunity and informs the results. Rebounds are credited to individuals, but it’s the team result that matters, and so a point guard that plays on a 76% defensive rebounding team can’t be called bad at rebounding based on “his” box-score numbers. Assists don’t even begin to capture the interaction between players on offense, and so on.

And don’t get me started on the box scores and a player’s defensive performance. There’s a lot of unfounded confidence in box score components and metrics because they’re familiar and you can explain them in a sentence. It’s refreshing to read authors like Hollinger and Oliver who say box-score mashups are just estimates with their own limitations and acknowledge them.

• Alex says:

I’m aware of the box score drawbacks (as I said at the end of my post). But even then they have an advantage over APM-type analysis, which is transparency. If a result looks funny you can figure out why. That is impossible in an APM system, and still difficult in a SPM system because the connection between APM and the boxscore is tenuous. If you don’t like how something values rebounding, or assists, you can tweak it. With APM that isn’t an option at all, and I see no reason to have confidence that it values anything (including players overall) correctly.

If by the frontpage of bballvalue.com you mean the top 10 players, then I suppose that’s true. It literally stops being true at number 11 though (Andre Miller, 6.39/3.25). And that isn’t entirely the point; if I need to know whether to pay Miller or Chris Paul, it doesn’t exactly matter that Paul is 2 SE above average; I want to know that he’s better than Miller. Paul comes out at 3 points better than Miller, but they both have errors of over 3 points. It doesn’t inspire confidence, and it only gets worse if you try to compare players that aren’t at the top of the board. Beyond that, I don’t even know if I should trust those numbers. Is Keyon Dooling really a top-20 player? Is Billups a bottom-20? Box score metrics may not describe reality, but why should I think any flavor of APM does?

3. EvanZ says:

Alex, you said:
“I’m saying that I’m unclear on how APM is useful given its unreliability. ”

I’m not sure which version of APM you are referring to. I don’t rely too heavily on the APM values from basketball-value at this point, and as you probably know, prefer to use the 3.x yr RAPM as my reference. I’ve shown that the R^2 between ezPM and RAPM over 3 seasons is about 0.5, which I think we can agree is fairly high. I further assume (but will leave you to test) that WP has a similar R^2 (maybe a little worse, maybe better, who knows). You might want to do a post similar to the one you did before using RAPM to predict or retrodict wins. Maybe it’s more reliable than you give it credit for. I plan to so such a test when I get time, but that might not be until the summer.

We both agree that 1-yr APM y-t-y correlation is low. But it’s not just us. Nobody who uses or calculates APM would disagree with that statement.

• Alex says:

‘Nobody’ is pretty strong; I see a fair number of people make use of 1-year APM. But you’re right that a retrodiction study would be nice.

• EvanZ says:

You’re right, too many people use 1-yr APM without really knowing how it’s calculated or the large amount of error. What I meant to say is no “expert” who uses or calculates APM would disagree. But then, I guess that’s my definition of “expert”.

• Alex says:

Seems fair, although I’m sure many people are of the opinion that no ‘expert’ uses WP 😉

I’ve never been able to find any hard numbers; do you know of work presenting the R squared or predictive ability of some of the other RAPM or multi-year APM models? I read the Sloan paper you mentioned and it sounded like 3-year RAPM didn’t do much better than 1-year in terms of predictions. It wouldn’t take much to improve on the .05 that Arturo reported for 1-year APM, but I’m curious how much better things really get. For example, I wonder what the R squared is for Ilardi’s 6-year APM. The errors there are greatly reduced (although still not fantastic, depending on how many above-average players you think exist), but I’m curious what the model fit is like.

4. Crow says:

“Has anyone attempted to see if APM becomes more predictive with more non-overlapping years? For example, if you create 2-year APM from 07-08 and 08-09 and used it to predict the 2-year APM from 09-10 and 10-11, how well does that turn out?”

I also proposed this awhile ago (maybe 6 months or a year ago, I don’t recall the date specifically) but at that time it really couldn’t be done from public sources for a full 4 year time span and 2 independent sets of 2 year APM. It still can’t from RAPM, but it can with traditional 2 year APM.

I compiled a sample of 67 players who appeared on either the top 50 for 2 year APM for 2007-8 & 2008-9 or the top 50 for for 2 year APM for 2009-10 & 2010-11 to date. It is a biased incomplete sample, but that is the extent of data compilation I was willing to do right now.

The correlation (r) of these 2 year APMs was… 0.0019.

Players change because of aging, team, role, injury, etc.
But this is still a result to think about. Hard.

Would a full league comparison be better? I don’t know.

Could the best RAPM with SPM and pairs incorporated in some fashion and other noise reductions do better? I’d guess- sure.

Could they move from this extremely low result to even a pretty good one? I am listening to my skeptical side right now. Hard.

• Alex says:

Thanks Crow. I would hope the correlation would go up with a full sample, since hopefully 2-year APM can at least separate the top 50 players from the bottom 50. But for the top players to completely shuffle like that does not inspire confidence.

5. Crow says:

That 2 year traditional APM appears to not be stable between independent timeperiods is one thing.

Whether existing 4 year RAPM (or a more advanced version) is useful for assessing the rough placement of a player on a scale between awful and great is another topic I won’t add further comment on right now.

• Alex says:

Agreed, although I think the real litmus test would still be if 4 year RAPM (or what have you) can predict how a player will do next year. Usefulness is priority number 1; deciding who should have actually won the MVP three years ago is farther down the list.

6. Crow says:

Yeah, maybe the biased sample from the top 50 is a major problem.

Whether 4 year RAPM can predict next year is certainly interesting / would be useful; but a rough historic quality measure has some value too.

7. Crow says:

A sample of 45 players in the west over a 1200 minute threshold had a correlation of .38. A very big improvement but still modest.

8. Crow says:

That was for the 2 yr to 2yr APM comparison.

Just doing those in the west for the full period shouldn’t be a problem, I wouldn’t think.

To me, at this point, 4 year RAPM is the best version and best used as a rough historic quality measure only and as a provoker of further research when its value varies widely from boxscore metrics.

• Alex says:

Seems like a move in the right direction. Thanks for the analysis as well, although then we get into another favorite issue: how consistent should we expect players to be anyway? I don’t know if that’s even an answerable question, since you have to pick a metric to check in the first place.

• Greyberger says:

If you think about the components of total-value metrics (scoring, rebounds, assists, etc) some are definitely more consistent than others. If you want to know if a player is a good three point shooter or not, one year worth of data is probably not enough to reach a (justified) conclusion. So I agree it is kinda an open question whether one year’s worth of information is enough to compare time period to time period.

• Crow says:

Ryan Parker did good work at his basketballgeek.com website taking season level boxscore data and constructing multi-season based confidence ranges about true skill levels.

Because as you suggest boxscore data is from a sample and boxscore based metrics are also subject to error in estimating true skill (if you believe in such a concept), though they never come with that admission and often come with the posture they are 100% accurate.

• Crow says:

But to balance that last statement out a bit, some components of boxscore based are indeed more consistent than others and as a roll-up boxscore metric they are generally more consistent than APM and consistency at each level is notable and useful.

But the consistency of the winning impact of these individual discrete stats for a season is not necessarily exactly the same season to season, game to game within season or player to player for either time period. Game situation affects the true win value of every action with some variation which is notable.

All season based metrics are merely approximations of game level winning impact. Including APM. This is a significant issue in my mind and I am planning to think about it some and hope to try to address it in the future and hope to see other efforts to do so as well.

• Crow says:

Wayne Winston enhanced his Adjusted +/- work with Impact ratings that account for at least clutch late game activity and its impact on actually winning that particular game.

In early work at ProTrade, game situation was considered in a boxscore based model every step of the way.

9. Crow says:

“how consistent should we expect players to be anyway?”

Yes. Assuming players “should be” or “are” stable in overall impact is an assumption and a debatable one.

So “if” players legitimately change in overall impact then metric consistency is not the right standard for metric appraisal.

10. Crow says:

“One big problem is theoretical; APM is a black box. The data goes in and the numbers come out, but we can’t say why they turn out the way they do.”

This is the reason to do Adjusted +/- at the 4 Factors level.

Yes it is still subject to error, but it will give signals about where the positive and negative impacts are likely coming from, especially for big minute players and especially if you are willing to compute / use multi-season RAPM Factors. That information about “where” can be used to adjust lineups to attempt to correct the main problem(s) and thru trial you could see if it works.