Monday, November 16, 2015

Come Out and Play

The Learning Problem

The Golden State Warriors have started the NBA season 11-0. If the team keeps up this pace, they will win all 82 games, sweep every round of playoffs, and never ever lose again. Trust me, I'm a data scientist.

Something sounds fishy here, right? Obviously, I don't expect the team to win every single game just because they haven't lost one yet. Why? For one thing no team has ever gone 82-0, so I consider that pretty unlikely. For another thing, the Warriors were a great team (some would say historically great) last year and they "only" won 81.7 percent of their games. So can I really expect them to be that much better?

On the other hand, seeing how they have performed so far, I must have learned something about how amazingly awesome they are.  The question is, how do I balance the information obtained from the first few games, with whatever else I know (or at least believe) about the team.

Using Bayes Theorem

This is a classic application of Bayes' Theorem.  Bayes' Theorem is an information processing rule. It is used to generate an updated probability (called the posterior probability), based on a some set of earlier beliefs (the prior), and some new information. So if I can tell you my prior beliefs, based on what the Warriors did last year, historical examples, and what Zach Lowe has to say, and then I look at some basic basketball history, I should be able calculate my updated beliefs.

My posterior probability would be written like this:
   P(end of season wins ≥ X | wins in first N games ≥ Y)

And it would be calculated like this:
   P(wins in first N games ≥  end of season wins ≥ X) * P (end of season wins ≥ X) / Evidence

Where:
   Evidence =  P(wins in first N games ≥  end of season wins ≥ X) * P (end of season wins ≥ X) + 
P(wins in first N games ≥  end of season wins < X) * P (end of season wins X)

P(wins in first N games ≥  end of season wins ≥ X)  and P(wins in first N games ≥  end of season wins < X) , are my conditional probabilities, and they can be calculated by looking at the historical record.  P (end of season wins ≥ X) and P (end of season wins X) are my prior beliefs.

A Simple Tool

I took Bayes' Theorem, found some data, and wrote some code to exactly this! First thing I need is the NBA historical record. Sports Database has an API, where I can access all the game scores from 1996 until now, so I grabbed them all (removing the two seasons with lockouts). That gave me a total of 558 team-seasons to work with. I sorted the data and calculated each teams' record after each game. Its not perfect, I'd like to grab a longer historical record and I don't love that I ignored the lockout seasons. But for a first pass, it will do.

Using this data, I can calculate P(wins in first N games ≥  end of season wins ≥ X).   For example, in my data set only 4 teams ever won at least 67 games: 1996 Bulls, 1999 Lakers, 2006 Mavericks and the 2014 Warriors. Only one of those teams started by winning 11 of their first 11 games, the 1996 Bulls. So P(wins in first 11 games ≥ 11  end of season wins ≥ 67) = 25%

So where does P(end of season wins ≥ Xcome from? Well, first I just need to choose a value of X that I am interested in. Lets go with 67 wins, the same number they won last year. This probability is going to come from a binomial distribution, which will calculate the number of wins for an underlying probability of winning each game. 

For example, I could assume the Warriors are simply an average team, and they have a 50% of winning each game. If that were true, there is a  0.0000000263% chance they would win 67 (or 81.7%) of their games.  

For another example, it would also be reasonable to believe that their true talent is more like what we saw last year. In this case, they have 81.7% chance of winning any given game. Then, there is 57% chance the warriors actually win 67 games. Why isn't it 100%? Because a small sample won't always match the expected vale; just because a coin has a 50% percent chance of landing on heads doesn't mean if you flip it four times you will get exactly two heads and two tails. 

So with all this information, if I choose a threshold (X) that I am interested in, an initial record (N,Y), and my priors beliefs about their probability of winning a random game, I can calculate my posterior.

The results

Lets play this out a little. If I think the Warriors were simply an average team this year, with a 50% chance of winning a game. Like I said before, this would mean they have a 0.00000263% of winning 67 games, pretty small. Now, if I saw them start off 11-0, I should update my beliefs to think they have a  0.0000869% chance of winning 67 games. Still pretty darn small, but actually 33 times more likely than had I not seen that. My prior belief is still pretty strong, but the new information has a lot of weight.

If I thought they were truly a team with a 81.7% chance of winning each game, they would have a 56.7% chance of winning 67 games. If I saw them go 11-0, I should now update my beliefs to think they have 97.7% of winning 67, much more certain.

What's Next?

There is a ton more to say coming form this little analysis. Why restrict us to looking at just a 11-0 record, or focusing on 67 games? Can we explore across range of priors? I plan on doing all this, and making some sweet graphics to help. I actually hope to build some interactive graphics to let users explore anything they want. And I may work on getting a larger set of data. Stay tuned!


No comments:

Post a Comment