Monday, February 20, 2017

Bayes Area

tl;dr: Bayesian Basketball Dashboard here.

One of the themes of my career to this point, from my doctoral research to being a data scientist in industry, to writing this very blog, is the interaction between substantive expertise and quantitative analysis. In some disciplines such as scientific research, these areas two domains are inextricably linked; a scientist with domain expertise proposes a model or hypothesis for how some phenomena works, and then uses data to confirm or reject the hypothesis. In other disciplines, there has been a tension between the two. For example, Nate Silver describes a conflict between traditional political pundits and his form of data journalism.

If you know me, its no surprise that I care deeply about data. I even feel silly writing that sentence. It seems obvious. Ever since I wrote a senior thesis in college where I analyzed auction prices for sulfur dioxide permits, I have loved getting my hands on data and learning from it. But it also seems like an empty sentence; data is everywhere and people use it in countless ways. To say that you care about data 2017 feels like saying a fish cares about water.

What I think others would find a little surprising is my willingness to overlook or go beyond data and trust human expertise. Just because I have numbers stored somewhere doesn't mean I have evidence. Obviously, data can be misleading or biased in someway. But what I believe is that in the absence of good data, people can be (in the right contexts) very good at integrating various pieces of qualitative and quantitative information and forming judgements.

Sunday, July 31, 2016

Research Matters

One of my first blog posts was about the fantastic book Between the World and Me, by Ta-Nehisi Coates. At that time, I said I was looking forward to writing about some academic research about racism and the use of force by police officers. As with many things in the blog, it took a while. In this case, it wasn't for lack of trying. Over the last six months, I found myself returning over and over again to google scholar, but was unable to find any compelling research in this area.

Then, the exact week that officer involved shootings became a major news story again, with two high-profile incidents, rallies across country, and then a shooting against police officers, a relatively high profile piece of economic research came out. A working paper, The Empirical Analysis of Racial Differences in the Police Use of Force by Roland G. Fryer, Jr  was posted on the website of the National Bureau of Economic Research (NBER). The paper examines whether African Americans, and other minority groups experience disproportionate amounts of force, after being stopped or encountered by the police.

Sunday, July 24, 2016

Hadoop... There It Is (Part 2)

Well, at long last, I have completed my Hadoop Raspberry Cluster. It took a couple of months to dive back into this project. I have my own personal cloud, running similar technology that power some of the worlds most important tech companies. However, my cloud is pretty lame. It less powerful than the MacBook Air that I am currently writing this post on. But, at least it's complete and time to write about it!

Saturday, May 21, 2016

Analyze That: Data Journalism and Trump

In the last week or so, I have encountered lot of discussion about the failure of data journalists (mostly the good folks at to predict Trump's nomination to the Republican Ticket. In fact, that's understating it a little bit, they were quite confident that Trump would not be elected - famously Nate Silver put his chances of winning around 2 percent. In a recent podcast and 538 article, Nate Silver did some interesting post-mortem on the analysis. In part, he critiques his own methods and in part chastises himself for issuing a subjective prediction that did not come from a computational model. For this, he states that in this particular instance, he acted like a pundit. He was too focused on his own priors and underestimated the uncertainty due to a small sample size of "Trump-like" candidates. At the same time, he does defend his use of empirical approaches.

Sunday, May 15, 2016

Analyze That

One of the things I often enjoy doing with my friends is thinking through some political, policy, economic, or business problem. Sometimes this an issue in the news, sometimes it's something that one of us recently read about or heard about on a podcast. Other times, it's some random topic that we happened to stumble onto over the course of a conversation. Either way, we generally just have a good time breaking such a problem down. We often jokingly refer to this as "consulting the shit" out of a problem. 

Tuesday, May 10, 2016


A few weeks ago I posted about the difference between machine learning and econometrics. Though I talked a lot about the applications of the two techniques, I tried to avoid getting very detailed about any of the algorithms involved. Recently, I've also been doing a deep dive on one of these machine-learning algorithms: Neural Networks.

About two years ago, I started hearing a lot about "deep learning" and a powerful algorithm called a neural network. These things seem to be everywhere: from Siri and to facial recognition. I must admit, somewhat embarrassingly, it took me quite some time to figure out what exactly a neural network was.

Everything that I read, when trying to understand the Neural Network, suffered from one of two problems. Some pieces just weren't that technical, and described the analogy of his algorithm to a human brain. They talked about things called "hidden layers" without telling me what was actually going on in them. The other type of post had a lot of math very quickly. It's not that I couldn't understand the math, but I wanted the high-level summary technical summary, knowing I would work through the math later. 

But, two things became very clear. First, these the "blackest box" of the machine learning algorithms we have. Everything about inference and decision-making in the last post does not apply when using the NNs. Second, they are really really powerful. They are extraordinarily good at addressing some of the toughest data science problems. Given their recent success, they aren't going anywhere.

So it was time to learn!