In the last week or so, I have encountered lot of discussion about the failure of data journalists (mostly the good folks at fivethirtyeight.com) to predict Trump's nomination to the Republican Ticket. In fact, that's understating it a little bit, they were quite confident that Trump would not be elected - famously Nate Silver put his chances of winning around 2 percent. In a recent podcast and 538 article, Nate Silver did some interesting post-mortem on the analysis. In part, he critiques his own methods and in part chastises himself for issuing a subjective prediction that did not come from a computational model. For this, he states that in this particular instance, he acted like a pundit. He was too focused on his own priors and underestimated the uncertainty due to a small sample size of "Trump-like" candidates. At the same time, he does defend his use of empirical approaches.
Unsurprisingly, I am a big fan of data journalism and Silver. I would happily defend Silver's work in comparison to the political pundits. But rather than debate the virtues or specific failures of this work, I would argue that the correctness of this particular prediction doesn't matter much. Comparing it to his previous "track record" is like comparing apples and oranges.
I tend to think the "right" or "wrong" framing that surrounds many discussions of Nate Silver's work totally misses the point. He is frequently credited for correctly projecting all fifty states in the 2012 election. This refers to his final projections, which as I recall was within a week, maybe less, of the election. Even a relatively naive model (choose any reasonable poll, or just the 2008 electoral map) would have gotten 40, 45, or 47 out of 50 right; there were only a small number of states that were actually in play.
The value that Silver's blog provides is the constant updating of expectations as new information becomes available. For example, President Obama had a pretty bad first debate, and I recall being pretty worried about the election (I was pulling for the President). A bad debate performance was certainly worse than a good debate performance. But, I didn't have an empirical sense of if this performance would have virtually no impact, or if it would cause a major swing. In the days following the debate as a new polls came out, Silver's model helped inform me how much the election had changed.
Within this in mind, it's good that Nate's final predictions were spot-on. That lends credibility to those early October projections that I was carefully watching, but it doesn't fully validate them. His model could be both accurate when projecting just a few days out, but suffer some real flaws when projecting a few months out. Because we can see he did a good job with the day-out projections makes believe those months out projection.
I should also add, while Silver's model is certainly impressive, it is within the realm of things that many people I know could build. His true talent is writing about it and putting a narrative to the election through the use of a model. In particular, Silver's writing, aided by the model, helps me find the signal in the election, through all the noise (to steal his own book title). So, I was never that impressed when he went 50 for 50, but am also not that upset that he was (potentially wrong). Instead, I believe his discussion and process is interesting, regardless of outcome.
Getting back to the Trump prediction, some people have argued (correctly) that because Silver described his projection in terms of a probability, we can't really say he is wrong. Maybe if we relived this primary campaign 49 more times, Trump never would have won any of them, we just ended up in the unlucky two percent of universes where he won the nomination. But its also possible Silver underestimated the likelihood of Trump winning, hence the controversy.
In Silver's discussion, he states that his biggest regret is not making it clear that this 2 percent was not the result of a computational model, but was a subjective probability based on his own knowledge. As he seems to describes it, the predictions should not have been taken with the same seriousness as his modeling results. He seems to believe that his biggest mistake is not making that clear, as an editor of website.
I am very sympathetic to that argument, but believe it is somewhat incomplete. Silver's description reads like he is discussing two entirely separate processes: one is feeding the computer of a bunch of data and the other is generating a subjective probability in his own head. But if you read the article with the two percent prediction, Silver is beginning to sketch out a mental model for how to think about this election. This is frequently the first step of building any computer-data model. Even when algorithms are making predictions, there are many assumptions and decisions that humans make(some explicit, like fitting data to some distribution, some implicit, like what data to feed the computer in the first). These decisions require some domain expertise, some mental models, and some empirical exploration of available data, prior to formalizing an algorithm. As these explorations are validated, they can be components of the computer-prediction algorithm.
I guess what Silver is saying is that his 2% prediction back in August probably didn't meet his standard of quality; this early exploration shouldn't meet the bar of an official prediction that 538 prides themselves on. Silver is free to determine whatever quality standards he wants for his reporting, and I appreciate that he is holding himself to a high-standard. But, I would hate for him to think any computational model is above the line (which I don't think he does), and I would also hate for him to discount expert judgement.
If it were me, I would not shy away from the fact that even the computer generated predictions required mental models and domain expertise (that were validated and tuned with data). The line between a good prediction and a bad prediction shouldn't be an algorithm; it should be empirical research on the best data available and cognizant of any short-comings that may come with it. In data rich environments, that may require a computational model. In situations with not a lot of relevant data, a mental model based on domain expertise seems not only acceptable, but the only reasonable alternative.
Something I hear a lot about these days is the differences between human judgement and fully-automated algorithms. For example, in the recent news about a liberal bias in Facebook's trending topics model, a lot of discussion I heard was questioning whether human curators could insert their biases or if it was all controlled by the algorithm. In the popular mind, the computer algorithm is some neutral party. But in reality, a poor model can reflect biases and oversights of it's creator. There are obviously times where the programmed model is better, because it can be programmed to overcome biases that human's frequently make, but only it is designed to do that.
In fact, in Silver's ex-post evaluation, he proposes a new Bayesian model of "Trump-like" candidates. This new model, would have given Trump a 10 to 15 percent chance of winning. I like the data exploration, but don't believe it is much better than what he had in August. It suffers both from methodological issues*, and more importantly I think could really use some domain expert knowledge. It pretty reasonable to say there are certain ways where Trump was more of outlier than many of those he included. Newt Gingrich was included as a Trump-like candidate, But Gingrich was Speaker of the House! I would totally think its reasonable take this a baseline model, but to apply some subjective downward adjustments based on the ways we know Trump is different from the comparison set.
Finally, I would say silver is being too harsh on himself, because he did have a process in place to update his subjective model, as more reliable data became available. Though Silver gave Trump just a 2% chance over the summer, he did go on to build a more formal prediction model using polling data. He also used other approaches, such as a panel of experts to describe possible paths towards a nomination. And sure enough, as more information became available and the time horizon became shorter, 538's predictions started favoring Trump. I think it would be fair to characterize their process as a Bayesian learning process, in August Silver had a prior that Trump had a 2 percent change of winning, and as more polls rolled out, he updated. I don't think its a flaw that Silver made his prior known (based on some reasonable, if potentially imperfect information), and then updated accordingly. In fact, I think its a virtue.
*He provides a uniform prior probability to Trump winning. This is saying, with out knowing anything else, it was just as likely that there was a 99% chance of him winning as there a was 1%. But there were 17 candidates, if each candidate had a uniform prior, you'd have an expected value of 8.5 candidates with better than 50% chance of winning.
No comments:
Post a Comment