Predictive Stats, Bad Metrics & Correlation in the NBA

Posted on 07/20/2010 by

30



In my work life, I prepare metrics for a living. I spend a lot of time sifting through data and trying to create effective and sustainable models to turn it into information. This information is then used to run large manufacturing facilities that make products that save people’s lives. Creating a metric or a model which is inexact or misleading can have dire consequences both financial and to the public health. So it is tantamount that I create models & metrics that are robust and explain the variability that I am trying to manage.

Bad metrics that mislead me and my company are a grave problem. My continued employment and success speak to the fact that I’ve managed to successfully identify and segregate the good information from the noise. I achieve this thru the use of correlation and the scientific method. Simply put when I have a set of Variables (Y’s) I am trying to manage (Cost, Output, Yield, OEE,Mean time Between Failures etc.)  I map my process, figure out the x’s (Man,Machine,Method,Enviroment,Material) and either use existing data to figure out the x’s that matter, create a measuring system to measure the data and then figure out the x’s or build and experiment to test it out.Rinse and Repeat.

Six Sigma, that's how we roll.

It is by this method that we come up the metrics that allow us to stay in business and succeed. So it is no wonder that I look at the majority of stats for performance in the NBA in dumbfounded amazement.

What I hope to do in this post is to take a look at the data regularly collected for NBA teams for one season (2009-2010) and submit it to the same kind of rigor I would in my work life to see what x’s (stats) I should be looking at if I want to accurately predict my Y (Wins). For the purposes of this discussion (and for your own amusement, blogging and frankly whatever else you crazy kids do with statistical data), I’ve put together all the stats for the 2009-2010 season in one convenient spreadsheet using data from Andres’ fancy site & Basketballreference com . Using this data I will be able to calculate the correlation for each of my x’s versus wins. Let’s get started.

Go to xkcd for more

I’m going to be looking a three sets of stats: NBA boxscore stats for teams,NBA boxscore stats for opponents & predictive stats. Boxscore stats I won’t explain but for the predictive here’s what I included (Warning Math Content):

  • Hollinger’s Player Efficiency Rating used extensively by ESPN to rank players (explained hereCAUTION you will need to have a trained math person closed by if you have questions). PER is calculated:
      uPER = (1 / MP) * [ 3P + (2/3) * AST + (2 - factor * (team_AST / team_FG)) * FG +      (FT *0.5 * (1 + (1 - (team_AST / team_FG)) + (2/3) * (team_AST / team_FG)))      - VOP * TOV - VOP * DRB% * (FGA - FG)- VOP * 0.44 * (0.44 + (0.56 * DRB%)) * (FTA - FT)      + VOP * (1 - DRB%) * (TRB - ORB) + VOP * DRB% * ORB + VOP * STL + VOP * DRB% * BLK      - PF * ((lg_FT / lg_PF) - 0.44 * (lg_FTA / lg_PF) * VOP) ]

Where

 factor = (2 / 3) - (0.5 * (lg_AST / lg_FG)) / (2 * (lg_FG / lg_FT))
 VOP    = lg_PTS / (lg_FGA - lg_ORB + lg_TOV + 0.44 * lg_FTA)
 DRB%   = (lg_TRB - lg_ORB) / lg_TRB

Got it ? Let’s move on.

  • The second number is NBA Efficiency used by the NBA itself . The calculation for this one is:

NBA Efficiency= ((Points + Rebounds + Assists + Steals + Blocks) – ((Field Goals Att. – Field Goals Made) + (Free   Throws Att. – Free Throws Made) + Turnovers)

A little simpler no?

  • Third is Win Score (WS) and it’s derivations WS/min ,Position adjusted WS/Min and predicted Wins per 48 (WP48) and predicted Wins Produced from Win score.

Win Score = PTS + REB + STL + ½*BLK + ½*AST– FGA – ½*FTA – TO – ½*PF

Win Score/min = WS/minutes played

Position Adjusted Win Score/min= WS/min – average Win score/min for all players @ pos.

Predicted WP48 = PAWS/min *1.617 +.100

Predicted Wins Produced = Predicted WP48 *Minutes Played/48

  • Finally are Wins Produced and Wins Produced per 48 minutes (WP48) (explained here)

So now that we’ve explained everything let’s look at some tables & results:
It is interesting to note that the first three no predictive statistics by predictive power (Opponent Assists, Opp. Points and Opponent Field Goals Made) are all defensive. NBA efficiency is not much better than those three at 50%. Win Score and it’s derivatives and PER come in a virtual tie in 2nd place with 70% correlation. Wins Produced stands alone in first with a 94.9% correlation to wins. Despite all the fancy math we saw, of all of these statistics only one was developed using correlation (no prizes for guessing which one).

Readers of this blog will also note that we came up with a few metrics here (see this article)

  • Using just a players box score stats (85% correlation over eight seasons)
W = 84.0 + 0.0445 FG – 0.0583 FGA + 0.0550 3P – 0.00866 3PA + 0.0176 FT- 0.0170 FTA + 0.0635  ORB + 0.0555 DRB + 0.0118 AST + 0.0683 STL+ 0.0112 BLK – 0.0620 TOV + 0.00656 PF)
  • Using just a players & opponent’s box score stats (94% correlation over eight seasons):
W = 64.6 + 0.0743 FG – 0.0307 FGA + 0.0194 3P + 0.00513 3PA + 0.0397 FT
– 0.0155 FTA + 0.0364 ORB + 0.00278 DRB + 0.00332 AST + 0.0100 STL
+ 0.00308 BLK – 0.0169 TOV – 0.00484 PF – 0.0605 OppFGM + 0.0113 OppFGA
– 0.0275 OppFTM + 0.00692 OppFTA – 0.0378 Opp3PM + 0.00461 Opp3PA
– 0.0105 OppORB + 0.0135 OppDRB + 0.00032 OppAsst + 0.00181 OppSTL
– 0.00751 OppBlk + 0.00544 OppTOV + 0.00214 OppPF

So to recap:

  • Wins Produced clearly the best model of the ones evaluated.
  • A guy with a degree, a blog for a hobby, excel & minitab and a free afternoon can develop a metric for productivity that correlates much more strongly than the one developed leading basketball stat geek for the most influential font of sports information out there and the NBA preferred statistic.

Some might ask, why is this a big deal? Teams in the NBA, media and fans are making and evaluating multimillion dollar decisions based on these bad statistics (and in some cases overly complicated ones, yes I’m looking at you PER).  People live and die by these teams and we can statistically prove that they’re teams are being mismanaged. At the end of the day it may be just sports but bad statistics should offend everyone.

Quick Note:

It’s been pointed out to me that the numbers for Win Score, PER & NBA efficiency get progressively worse with a larger data set .  For 1978 through 2010:

Metric                   Correlation

Win Score            60%

PER                        28%

NBAEFF                28%

Wins Produced remains at 95% for that data set as well.

Note #2:

Professor Berri notes in the Comments:

“Those results Arturo reports are correct (and there were not adjustments made to the data). One needs to remember that Arturo’s analysis is only based on one year. So his n is 30. This is a very small sample. The larger sample gets closer to what is going on.

One should also add that the original one-season result is misleading. PER is adjusted for pace. Win Score is not. At the player level, pace is not really important. But at the team level it matters. Win Score + Pace will explain more than PER for this past season as well.”

To that end here’s a version of the spreadsheet with pace adjusted Win Score.

The final numbers for 2009-2010 are :

Metric                                                        R-SQ

NBAeff                                                       50%

PER                                                              72%

Pace adjusted Win-Score             80%

Wins Produced                                   95%

Posted in: Uncategorized