Readers familiar to this blog know that I’ve been working on a model to predict success in the NBA using the Wins Produced metric (See the Basics here). In a sense, it’s the mission statement of this blog. The intent is to shake out the tools and build a model piece by pace, put it through it’s paces, rinse and repeat and over time get closer to simulating the truth.
The development version is already up and released to the public for beta testing (see here) and the full pre-season build is coming (and endless refinements as the season goes along) but before I get to that I need to deal with one of my favorite topics: the draft and rookies.
Now the draft is notoriously hard to model and a simple answer would be to just use some dummy variables for rookies and carry on but readers by now know that I never take the easy path. So the question becomes how do we model rookies?
For this exercise, I went ahead and did a full build combining all the combine data from Draft Express (yes all of it, I have been working on this for a while) with all the WP48 data for rookies. Then I took the data and started looking for variables that correlate to rookie year Raw Productivity per 48 minutes (ADJP48) . Please note that I said rookie year and not 1st 4 years that is a slightly different model (and post :-)). I found the following variables that correlate in a meaningful way:
- Height
- Position
- Age when drafted
- Win Score per 40 minutes
The equation I came up with based on these variables is:
ADJP48 = K – A* HEIGHT + B* SIMPOS – C* DFTAGE + D* WS40
Were K,A,B,C,D are constant
With a correlation of 42% for every player that played more than 400 minutes as rookies coming from college (from 1996 to 2010 that’s 373 players). In Graph form it looks something like this:
The full table is here. But what does it actually mean? When I look at the error by Age and Position I see the following:
The model is consistent and it’ll allow me to look at a player and predict within reason who they’re going to be. Given that I only care about one side of the tail (i.e. if my model oversells a player (false positives) it costs me money, if it undersells him (false negatives) its money in my pocket) the model is better than the straight correlation indicates.
Let’s illustrate. Here’s the best ranked rookies who actually played from 1997 thru 2006 (the last ten year period where the draftees have at least 4 years of data):
If I consider a hit drafting a player who is at least a career .090 WP48 player then the model hit 36 of 50 times for 72%. So if I have multiple picks in a draft, I’m assured a decent player and since the average pick for the group is 13 these players will be available late. As for the last few years here are the recommended picks:
You’ll note that Blake Griffin isn’t in this group (hasn’t played yet) but overall the list is strong. Beasley is the turd in the punch bowl but I would remind everyone that he’ s only played two years in the league (and this might be by his own admission the first year he plays clean).
As for the misses?
Missing Lee and Odom hurts but it’ll have to do until we build a better college model.
So now that we have the model the next logical step is to project the incoming 2010 rookie class and I’ll do just that. Tomorrow. In part 2.
Alex
10/08/2010
Hey Arturo – It looks like you use simple position as a continuous variable here. Do you do better if you make it categorical? I assume the jumps in productivity aren’t the same from point to SG to SF to PF to center.
arturogalletti
10/08/2010
Very probably. Got to leave some improvement for the next version. I’ll play with running a by position regression equation..
jglanton
10/08/2010
Arturo,
The first the that came to mind when you used ‘height’ in the formula was to refine it to use ‘reach’. It might help remove some anomalies to separate the pterodactyls from the T-Rexes, as some of the pterodactyls overachieve for their height, and vice-versa.
arturogalletti
10/08/2010
We looked at reach as one of the variables and it didn’t really correlate strongly. The combine data is actually a big waste of time so far. So far the only questions that matter are:
Can you play?
What position?
Are you tall?
How old are you?
Everything else resembled noise. I will however revisit the combine data and the can you play question in the future.
Neal Frazier
10/09/2010
When looking at the age of the draftee, is the problem with younger players more that they aren’t mature enough to compete with men yet or is it that we haven’t seen them enough to figure out how good they will be yet? Not sure how you would tease this out in the numbers…
arturogalletti
10/09/2010
Actually, the model favors younger players. If you have to players with similar numbers go younger.
Shawn Ryan
10/09/2010
Damn Arturo! I want to be just like you when I grow up!
arturogalletti
10/09/2010
Thanks. Just wait till the sequel! 🙂
Fred Bush
10/10/2010
So, height is bad? Am I misreading your equation or are you burying the lede?
arturogalletti
10/10/2010
It’s a combined effect. College performance is devalued by height and increases with youth. So the performance number is more likely to correlate if you’re shorter and younger. So a 19 year old 6’6” center who lit it up is more likely to have success. If you’re tall and old you have to dominate in college to dominate in the pros.
Fred Bush
10/10/2010
If that’s true, I’m going to guess that’s a highly exploitable flaw in teams’ valuations of players. I would assume that most teams think that, all things being equal, a taller player would be better. How much of the difference between actual draft position and your algorithm’s draft position is explained by that single variable being (-) rather than (+)?
arturogalletti
10/10/2010
Though to tell but it’s significant. I’ll run some numbers. The point is height should lead to production or it’s worthless.
Evanz
10/11/2010
I see Horford and Speights on the list. Was Noah a miss?
arturogalletti
10/11/2010
Yogi missed him, Boo Boo got him (see part 2)