For those of you haven't been following along, let me lay out the problem briefly. I spent this summer discussing with friends how good the Warriors would be this season. I thought they would be good, probably the best team in the league, but a little worse than they were last year. Essentially, I figured they would be an average best team (the average winning percentage of the winningest team from each season is 77 percent, below the 82 percent wins the Warriors had last year). But they got off to a really hot start, so what should I think now?
For the purposes of an example, let's consider two extreme options. First, I could only pay attention to the sample of games I have observed this season. If I did that, I would assume because they are 16-0, they would never lose again and end with a winning percentage of 100.
Or, I could be stubborn; nothing I have seen changes my mind. In this example they are still a 77 percent win team. But, they already have 16 wins in the bank. If I believe each game is independent, the 77 percent only kicks in from this point forward. This would mean I think they should win about 51 of their next 66 games. This would put them at 67 wins for the season, exactly where they were last year.
In reality, the truth lies somewhere between. Watching them go 16-0 suggest that they are better than the average best team, but I still don't think this pace is maintainable. So I suggested that we could use Bayes' Law to solve this problem.
The concept was correct, but my implementation had a number flaws.
First, my calculations depended on fairly sparse historical data. I needed calculate the probability of winning the first 13 games (I did this a week ago), conditional on winning 67 games. This concept would work fine in the range where we have lots of teams. For example teams that win about 50 percent of their games are pretty common, so the probability of winning 6 of the first 13 games, for teams that ultimately win 41 games would be pretty easy to calculate. However, once I am dealing with extremes, the data is sparse. This small sample size meant that weird things could happen (like the example of Mavericks that I described last week). It was bound break once the Warriors went 16-0; we have no data in this unprecedented case. This nonparametric approach was pretty fragile.
Second, my approach was computationally expensive. I only provided a simple example, the probability that the Warriors repeat their 67 win performance, based on their current record. If I wanted to calculated the probability that they won 70 games, that would a be a separate calculation. And if I wanted to look at a 14-0 record that would be yet another calculation. Imagine doing this for every win total and every possible record to explore. It would have been pretty doable, but probably a lot of extra calculations.
So I did two things to solve this problem.
First, I got more data. I scraped basketball reference season tables to get every game each season from 1967-1968 until 2014-2015. This should make my data little less sparse and let me work with larger sample sizes.
Second, I did some reading about conjugate priors. For some combinations of probability distributions, there closed form ways to calculate the updated posterior probabilities. This solves both flaws described above. By using probability distributions (a parametric approach), I am taking my data and drawing a curve through data points that I have. This means I can look at the curve I drew even where I don't have data. Second because there is an algebraic solution, I can find the equation of new curve without having to calculate each point.
It turns out one of these conjugate priors is perfectly designed for this type of problem and really simple: the beta-binomial. A beta distribution has two parameters, alpha and beta. And a binomial distribution counts success and failures, or in this case wins and losses. If you have a beta distribution for a prior and then you count wins and losses, your posterior is also a beta distribution. But instead of plugging alpha and beta in the equation, I can plug in alpha+ wins, and beta + losses. If I can just describe prior with a beta distribution, I am good to go.
For the purposes of an example, lets say I knew nothing about sports and assumed any wining percentage is equally likely (I know, I could do better than but hang with me). This would mean I think winning 0 games is as likely as winning 50 percent of games as is winning 100 percent. This straight line prior in the image below. After seeing them go 16-0, there is essentially no chance they are worse than a 70 percent win team, and probability sky rockets that they are a 100 percent win team.
Looking at the cumulative distribution; if I had this prior, I would believe there is only a 20 percent chance the 16-0 Warriors are anything less than a 90 percent win teams. I would believe there is an 80 percent chance they win between 90 and 100 percent of their games. Because this a "weak" prior with no information, my observations this season count for a lot. My posterior is concentrated around the Warriors being a very, very good team.
But this is pretty dumb; it doesn't take into account the fact that sports don't work this way. Obviously, being a 50 percent win team is more likely than being a 100 percent win team. More importantly, it doesn't take into account that I have tons of awesome NBA data (that I spent my Sunday grabbing), and can actually look at how teams have performed in the past to get my priors. As a second example, I just took the winning percentage of all NBA teams since 1967. In the background I show a histogram of win percentages. The gray curve is the beta distribution that I fit to this data. The posterior shows a pretty dramatic shift. 16-0 suggests they are much better than average (good thing I did tons of work to figure this out)
Looking at the CDF, I see there is slightly less than 60 percent chance they are at least an 80 percent win team. This is is much better than the average team, which has a pretty negligible chance of being an 80 percent win team. Of course, since I am basing this on much more information that I previously had, I don't think the Warriors are as good as I naively did before.
Finally, lets take a look at my actual prior, the Warriors were going to be the best team in the league. If I fit my prior to the distribution of best teams from each season, this is what the graphic looks like. Notice that the posterior is shifted only slightly to the right of the prior. I guess this means that going 16-0 wasn't actually that strong of a signal. This actually makes sense, as the best team would probably start 12-4, 13-3 or 14-2. The Warriors really only have between 4 and 2 "extra" wins than the average team best team. But if I had thought they were an average team, I have probably seen 6 to 10 extra wins, a much stronger signal.
Looking at the CDF, I see that there is a 60 percent chance they win at least 80 percent of their games, a slight improvement than my prior. This is also only a slight improvement from if I thought they were average.
Ok, now that I have done all the statistics, what do I actually know about the Warriors?
There is a lot of talk about the Warriors winning 70 games, and I wanted to know how likely that is. So lets take the 16 wins in the bank, and the posterior on their win percentage (assuming I thought they were a best team in the league but not an abnormal best team). In this table, I show the probability that the Warriors win at least the number of wins in the win column. So there is a 98 percent chance they win at least 63 games, a 77 percent chance they match last years record of 67 wins, and a 36 percent chance they make it 70 wins. Not bad, but it's kind of surprising how much less likely those extra three wins are!
wins | probability | |
---|---|---|
0 | 63 | 0.98 |
1 | 64 | 0.96 |
2 | 65 | 0.92 |
3 | 66 | 0.86 |
4 | 67 | 0.77 |
5 | 68 | 0.65 |
6 | 69 | 0.51 |
7 | 70 | 0.36 |
8 | 71 | 0.22 |
9 | 72 | 0.11 |
10 | 73 | 0.05 |
11 | 74 | 0.02 |
12 | 75 | 0.00 |
I also have some interactive versions of these graphs in Javascript, the but blogger isn't playing nicely with it. Please let me know if you have ideas for how to host it, for free or cheap.
ReplyDeletegithub all the things!
DeleteGood stuff Evan. What kind of data did you scrape besides win-loss? It would be interesting to see the win percentage for teams when shooting above a certain percentage behind the arc. Since there is unlikely to be many comparable teams for comparison in terms of shooting efficiency, perhaps you could use the win-loss record of all teams that have comparable statistics to Golden State as your baseline.
ReplyDelete