Sunday, December 20, 2015

Stupid Data Tricks

A Markov-Chain Text Generator for Evanometrica

You will all have to excuse me, as it has been a little bit longer than I was expecting since my last post. First, I was traveling. Second, I found that it is a little more time intensive to generate high quality blog-posts than I initially assumed.

Being a data scientist and all, I figure I can just get the computer to do it for me! I decided to write an algorithm that can come up with headlines of blog-posts for me. It's an extremely simple Markov-Chain Model text generator, that leads to some entertaining results. This was fun, but accomplishes nothing of value, so it was a Stupid Data Trick.

The Model

When I laid out the major topics of this blog, I figured I would write about Bay Area Sports, Data-Science, and Economics. It is only appropriate I use my data-science skills to generate articles about Bay Area Sports and Economics. To build my text generator, I needed to find a vocabulary that reflected these interests. To get the sports, I pulled headlines from a San Jose Mercury News sports writer, Tim Kawakami, focusing on articles categorized as being about the 49ers, Giants, and Warriors. To get the economics language, I pulled titles from the National Bureau of Economic Research's archive of working papers. Mixing these titles will create series of article headlines that merge my two interests.

Now, grammar has some type of structure to it. If I just built a complete grab-bag of words (randomly choosing words from both sources), it is pretty likely that my headlines wouldn't even reflect English. Fortunately, there is a pretty simple algorithm that will make it at least resemble real english: a Markov-Chain model.  A Markov-Chain is a pretty simple structure to represent a random process. The process transitions from state-to-state, referencing the prior state the probability it transitions from that state to another.

A classic example of this is the weather. In this simple model, I might assume there are two states, sunny and rainy. Each state has transition probabilities: If it is sunny today, maybe there is a 75 percent chance it will be sunny tomorrow, and if 25 percent chance it will be rainy tomorrow. If it is rainy today, maybe there is a 50 percent chance it rains the next day and a 50 percent chance it is sunny. Now, if I was really building the weather model, instead of pulling these probabilities out of thin air, I would actually look at historical weather to estimate these probabilities.

Using these transition probabilities, I can simulate sequences of weather. I can start on a sunny day, and randomly draw the weather of the next day. Once my simulation tells me the second days weather,  I can do it again for the third day. And again, and again

Ok, so maybe this isn't all that interesting when just looking at sunny and rainy days. But these tools have been used to analyze transitions in much more complex systems and learn something about their behavior.

Text can be thought of the same way. Each word is state, and there is some probability of transitioning to the next word.  Just like I would look at historical whether to create transition probabilities, I look through the vocabulary that I have assembled and keep track of the words that follow another word. I store the distribution of the following words.

Once I have these distributions, I can build sentences. I just provide a start word, I randomly draw the next word from the word, based on the estimated transition probabilies. Keeping track of the sequence, I build a headline.

I also treat the end of the headline as a state, to create a natural end.  By including words from sports and economics vocabularies, the process will naturally transition between the two. This because there are enough words common to both source (such as prepositions and articles) that it is not unreasonable that a word from the sports headlines could follow a word from an economics paper title.

So then I generated a whole bunch of crazy headlines, by sampling from the first words of all the headlines in my sample and letting those be the seeds. The results were pretty entertaining.

Some Articles I Would Love To Write

Though my simple model is far from perfect (I will get to this later), I am very entertained by some of the suggested blog posts. Here is a list of titles, that I can't help but giggle when I read them. Maybe one day I will write one of these!
  • Money Market Miracle in The First of Bilateral US-China Trade: On Steve Kerr and more
  • Buster Posey: Evidence
  • Steve Kerr with the U.S. Data on the Post-Welfare Reform Act
  • Cascades in Online Markets Hypothesis: The 49ers and Nutrient-Specific Taxes
  • The Influence on its Bartolo Colon Cancer
  • NaVorro Bowman and Cash, Housing Boom, LeBron James, and Economic Activity
  • Steve Kerrs philosophy: Evidence from the United States
  • The Warriors evaluation, Discrimination: The Effects
  • Post-game Giants: A Study
  • Getting Cars and Justin Smith: Health Insurance
  • Jim Harbaugh on from the Historical Heights and Wage Effects of Cooperation
  • The Yuan and Survey: Evidence from a Clustered Randomized Evidence from my go-to NBA finals
  • Warriors to Delay Claiming Behavior: Right to Pay In-state Tuition at 16-2, including Joe Thornton vs. Financial Market Search over Carolina
  • Qualitative Easing: Greg Roman goes cap-less
  • Does Federally-Funded Job Creation: The Importance of a giant
  • Death and the 49ers and knowing the Cost of Dishonesty and the Business Cycles
  • The Giants-Padres tilt the Evolution of Gaussian Affine Term Impacts of Public Health Parity Mandates
  • Tim Lincecum and the Warriors thrill-ride hiring of School Ties, late line-up, Revenues, and Neighborhoods
  • The Giants properly and Income Redistribution
  • Zito and Tax
  • Warriors add Shaun Livingston and Equilibrium Unemployment
  • Not-so-Classical Measurement: The 49ers?
  • The Warriors see their fans are Specific Peer Effects of the Taylor Rules for Welfare
  • Joe Lacob and Cigarette Excise Taxation with Politics
  • Bayesian Learning in a Rose? Bruce Bochy pre-game: A Model of Consumption Fluctuations
  • Steph Curry: Does Energy Outlook for Public Policy Effectiveness
  • Government Behavior: On Melky Cabrera and Health Insurance Reform and Family Ties
  • Rent-sharing, the Giants struggle, and Zito, and flirtations
  • Andre Iguodalas health, and Objective Indicators
  • Liquidity Is it? The 49ers influence and end of Medicare Buy-In
  • Robust? Evidence from Marathon Runners
  • The 49ers over Time Varying Risk Premia
  • How Consumers and the MVP refused to Boston's Labor Market Read: Latent Class Size Distribution Channel in Historical Perspective

Too Much Nonsense

Unfortunately, these hilarious headlines are something like 10 to 20 percent of the generated titles. Too many of them are more like this:
  • Some Unsettled Problems of Employer-Provided Health Bank Assets
  • Self-Fulfilling Debt, this era, right with Sandoval talks: Lessons from Federal Tax Policy: Evidence from US? How the Magnitude and Public Safety: Crises
  • Morning 1, Institutions on the Old Keynesian Models
  • The Extensive Margin of Risk-Taking: Some?
  • A Study of Opinion '': Evidence
  • The Geography of Distance to a Credit Card Market Outcomes in the Draymond Green, 19502010
  • Raiders latest chapter: Evidence from a team they need itand what it
  • Market Responses Undercut Their Peers on the Banque de France, Peer Effects of Corporate Income, and thats almost 5 reasons Friday 1 salary, and Time


These example reveal some common errors. First, I suspect I am over-sampling the economics titles, as many of these titles barely mention anything sports related. I need to investigate the sample sizes to make sure they are relatively balanced. Second, I find the funniest of these titles are a nice mix of sports and economics. I am going to explore some methods to enforce some mixing the two sources of text. Third, there are some pretty bizarre grammatical errors, including inconsistent use of upper-case and lower-case of letters, punctuation that make zero sense, and numbers with no meaning*.  I'd like to use python's natural language toolkit a little bit to correct for these errors.

Future Vision

While cleaning up some of the nonsense is necessary, there are a couple more things that I would like to explore. The Markov-Chain that I set up only references the prior word. I could also define a version of this that references the two prior words. This should help ensure that the model is both a little more grammatically consistent and also maybe more thematically consistent.

Second, I'd like to build this thing to be a twitter bot. This would just be a fun little excuse to play with the Twitter API.  Ideally, I would set it up so that a user could tweet a word at the bot. It would then respond with a generated evanometica headline about that word.

This requires a few more steps. First, a user might tweet a word that is not in my sampled vocabulary. I would need to find a strategy to that would search for similar words. Second, I built this by randomly choosing a word to start with and building from there. However, the headline might be better served with the seed in the middle, and I would need a strategy to build forwards and backwards. Finally, twitter has a character limit, so I would need a strategy to ensure these titles end (somewhat sensically) within the limit.

Stay tuned for these improvements and a linked github-repo.

* Anyone who has read enough my writing might say that this is feature not a bug when it comes to emulating my voice

No comments:

Post a Comment