Sunday, March 13, 2016

Machine Inference

Hal Varian, the chief economist at Google, put together an awesome presentation about the differences and similarities between machine learning and econometrics. Varian clearly knows his stuff. Working at Google, he is privy to some of the foremost experts on machine-learning. And he no slouch when it comes to economics; he is the author of many students' favorite microeconomics textbook.

After getting a Ph.D in policy analysis where I studied a fair amount of econometrics, I started working as a data scientist using machine learning. Though I am not the same level expert that Varian is on the issue, I too have found myself talking about this a fair amount. So, at the risk of repeating much of the same subject matter, I decided I would write my take on it. My thoughts are a little less technical than Varian's, and are largely aligned with his presentation. By working through a real-ish example, I hope I can describe the risks of thinking addressing a topic that really requires econometrics with machine learning.


Machine Learning and Econometrics

Machine-learning is a field off study about algorithms which allow a computer to be trained by ingesting data rather than being given explicit instructions. It's closely related to statistics. In-fact, I took an advanced statistics course in graduate school and then it took me three years to realize that I learned what is now usually called machine learning.

Many machine-learning algorithms (and most of those that I will focus on here) are called supervised-learning where the computer is trying to classify or predict something. The algorithms rely on defined target variables, that are being predicted based on other pieces of data, called features. Here's a couple of examples:
  • A spam filter, where a computer reads the text of an email and classifies whether its an e-mail most people would consider spam or not.
  • The Facebook newsfeed which predicts which posts from your friend's will be of most interest to a specific user
  • OCR, where a computer sees pixels of various intensity and classifies them as characters.
Data-scientists spend a fair amount of time collecting, formatting, moving, and cleaning data (called wrangling) and testing and tuning different algorithms to make predictions or classifications.

Econometrics is the field of study in economics that uses statistics to create economic inferences. These economic inferences are generally about isolating the impact that one independent variable, has on another dependent variable. Generally, the economist using these tools is interested in understanding the causal impact.

Here are a couple of examples of what a good econometrics study may try to identify:
  • The impact of a college education on an individuals future earning
  • The impact of increasing temperature on GDP
  • The impact of a change in prices on aggregate sales
In economics and many social sciences, this is really hard because the economist generally can't just go run a randomized experiment to get an answer. I had a professor in grad-school who once joked that because he had twins, he could finally answer the impact of college on earnings. He would just flip a coin, one twin would get to go to college and the wouldn't. Instead, economists usually rely on natural experiments or something approximating them, and using complex statistics to measure an effect size and ensure it is statistically significant


In summary, machine learning generally aims to create accurate a prediction as possible, but econometrics generally aims to understand how one thing impacts another.

But it these two very different aims have a lot in common. Both require data and computing power to process the data. Both rely on complex statistical algorithms (often requiring a ton of linear algebra) to do the analysis. The targets in machine learning are the dependent variables in an econometrics problems. And the features in machine learning, are independent variables. Both often employ the same algorithms. The first machine learning algorithms most data scientist learns are linear regression and logistic regression, which are the work-horses of econometrics. 

Of course, there are differences. Many of the advances in machine learning are other algorithms that are increasingly complex. It is often difficult to exactly tease out how one feature relates to the target, even when the data scientist can confirm that it provides a signal.

On the other hand, much of the hard work in econometrics requires is tuning them to make sure the estimates to statistical significance are correct, and building a credible argument that it demonstrates causality. 

When to use each method
When I attended an INFORMS* conference years ago, I saw a presentation (I can't remember who the speaker was anymore) that three categories of uses of data: descriptive, predictive, and prescriptive. Descriptive uses rely on data and statistics to describe the world as it is today. Predictive uses rely on data and algorithms some guess about the future. And Prescriptive uses generally try to answer the following question "give a prediction, what should I do?"

Obviously there is something of a hierarchy here; each of the three categories builds on the others. It's really hard to know whats coming in the future without knowing whats happening right now. And, it's really hard to give advice about what to do in the future without some some projection of what the future looks like.**

In my humble opinion, the greatest advances in machine learning as of late have been in the descriptive area. Being able able to classify an image or understand human speech are descriptive problems. They take something that is happening now, and translate it into something else that is happening now.***

Of course, machine learning has also had successes in the predictive arena. A predictive problem is simply a descriptive problem that looks at some amount of time into the future, but it could be very very small. For example, if a search-engine user types in something, what link will they be most likely to click on once the engine returns results. This is a vaguely predictive problem.

In my experience, such machine learning has greatest success when the data scientists limits what they are trying predict. Predicting something a second a from now is easier than predicting something a week from now, which is easier than predicting something a year from now. Also, the lower resolution the prediction the easier. Predicting whether a customer will spend at least $50 at a store is easier than predicting precisely how much the customer will spend. Part of the "art" of data science is figuring out how to recast a complicated prediction problem into a simpler one.

But the prescriptive arena is a whole other problem. The easiest form of a prescriptive problem would be comparing two different predictions, with one independent variable that the decision-maker can control. But, now instead of simply searching for signal in noisy data, the data scientist needs to isolate the impact of that particular independent variable and hold everything else constant. This requires inference and the types of tools provided by econometrics.

Because prescriptive problems require a different toolset, I think its common for data-scientists to try to recast a prescriptive problem as a predictive problem. They probably don't even realize they are doing it, but risk getting themselves into some trouble.

An Example: Customer Churn

In the abstract, my last two paragraphs are probably pretty unclear. So I will make this concrete, with the example of customer churn. Imagine a company that sells subscription to its services; it makes money each month when a customer pays a set fee.  I'll call this imaginary company Spotdoraflix. Spotdoraflix is a pretty awesome company, it lets customers consume all type of media on their smartphone so long as they are subscribed.

Because Spotdoraflix makes money off of subscriptions, they want to prevent customers from deciding to cancel their subscription or "churn." So, they decide to hire a data scientist to predicted when customers are going to churn and tell them what to about it.

The data scientists decides to build a model that can be run at the beginning of each month and predict how likely each customer is to cancel their subscription by the end of the month.  So the data scientist collects a bunch of data and builds a machine learning model.

They collect some data about each user. First, they get some attributes of the customer such as their home city and what type of phone they use.  Next, they pull some data about the customers' relationship with Spotdoraflix, like how long they have been subscribed for. Next, they calculate some features about the customers' behavior the previous month, like how many hours they used Spotdoraflix for. Finally, the data scientist generates some features describing how their behavior has changed over time, like the ratio of hours used Spotdoraflix for in the last month compared to the year before.

With this data, the data scientist could go build a simple machine learning algorithm. The might try a Gradient Boosted Machine (GBM), an algorithm which generally performs very well at this type of problem. To get a training set, they would probably look at all customers at the beginning the previous month, mark whether or not the churned during that month, and train the model using data from before that month. They would probably spend a few weeks building new features, and tuning the model, but could probably get a pretty good predictive model.

Then they might hand a list to the marketing team of each customer's probability of churning, telling marketing those with the highest probabilities are the ones the should focus on. And they could tell the marketing team a little bit about what features mattered most; the GBM provides that information. Let's say they find that the type of phone, and the number of hours the customer used Spotdoraflix in the last month were the two most important features. So the data scientist pokes around a little more, and provides some summary statistics, such as the customers with Android phones were 3 times more likely to churn than those with iPhones and that people who used the product for less than one hour per month had a 25 percent chance of churning the following month.

This is great, except the marketing team still wouldn't really know what do with the list.

I could think of at least two reasons why the data could find  people with android phones are more likely to churn than people with iPhones. Maybe the Spotdoraflix app runs much better on iPhones. Or maybe iPhones are, in general, more expensive than Android phones, and having an iPhone is a signal that the customer is richer and this less likely to cancel a subscription. If it were the first case, the marketing team would to tell the engineers making the apps, and Spotdoraflix could invest a bunch of money in making their Android app much better. But if its the second case the phone itself isn't going to matter, these folks are still cash constrained. Maybe Spotdoraflix should offer these folks a discount. The point is, the GBM only found that Android phones correlate with churn, but does not provide any insight into the cause.

So let's say the marketing team does some interviews and research, and decides to give the customers most likely to churn a price discount. Of course. the data scientist didn't include price in the algorithm, because until this point all Spotdoraflix users paid the same price. There was no variation to exploit in the model.

But Spotdoraflix doesn't know how big the discount should be. Should it be a 25 percent discount for one year, or would 10 percent for three months get the job done? Obviously they would like to do the cheaper option, but not if its not going to be enough. So Spotdoraflix should problem run a randomized experiment. This experiment is inference and the results could be analyzed using econometric tools.

So, let's say Spotforaflix's marketing team goes ahead and starts this experiments. They have 3 groups, one that is unchanged, one that gets a 25% discount for one year, and one that gets a 10% discount for three months. Its not perfect, but it begins to fill in the gaps. But theres some problems. First of all, all the great information the machine learning is supposed to provide isn't there. It's going to take a long time to get results of the experiment. Is the discount enough to keep people from churning next month? Is the discount enough to keep them from churning the whole time they get the discount? What happens when the discount ends? It's going to take well over a year to answer all of these questions

Second, the GBM is all of a sudden wrong. If the data scientist doesn't do anything, theres a good chance he predicts many of the same folks to churn the next month. After all, the price change doesn't really effect anything else in his model.

Or he can naively update the GHB, by training it on the same data, but including the first months results of the experiments. But if the discounts are very effective, lets say everyone who the discount stayed on and didn't churn. Now, everything will be reversed, people with android will be less likely to churn, because they all got the discount.

Including the discount itself in the model would be the best solution, but the results won't be clear until long after the experiment.

The point is, once the experiment has begun, the world has changed. The old model is invalid, but its going to take a while to get enough data to train a new model effectively.

So what?
My broader point here is that  for many predictive problems, target is somehow endogenous to the model the response to the model. That is, once a prediction is created, somebody acts on it, and what was previously predictive no longer is. Even take search, while the search engine is trying to predict what link the user would like to see, by putting a new link at the top, it becomes a fulfilling prophesy.

In the case of churn, the marketing team is doing something to change who churns, so the old model is no longer valid. And it's going to take a fair amount of time to collect new data on who will churn next, within this new system.

A data scientist skilled at econometrics, might instead try to find "what is causing the customer to churn?" Now, this could be anywhere from harder work to completely impossible. This is the stuff that Ph.D. dissertations are made of. But if the data scientist found something, then the company could think about changing that directly. And at the very least, the data scientist would be aware of the deficiencies of their own model once the system has changed.

I am not saying machine learning is wrong. But there often risks that frequently aren't considered when using machine learning on endogenous systems. Would love to hear anyone else's thoughts, or experiences with this

* the professional society for operations research
** I have some ideas
*** Of course, this is a huge generalization. Being able to win at Chess or Go is a prescriptive problem (what move should I make). It's really hard, but its also in a very very constrained environment.

No comments:

Post a Comment