Sunday, January 10, 2016

Hadoop... There it is (Part 1)

Adventures in building my own personal cloud

Around Silicon Valley, people talk a lot about Moore's Law, which observes that microprocessors double in power (for a constant cost) roughly every 18 months. This law has produced something else that Silicon Valley types talk a lot about: Big Data.

That's right, Big Data did not just pop out of nowhere. As computers have more cheaper and more powerful, the cost of storing data has dropped dramatically. When computers were slow and expensive, people and companies had to make choices about what data to save. Today, its probably more expensive to hire people to spend time thinking about what data to store than it would be to just throw it on a computer somewhere. 

Actually, that last sentence is not quite correct. As powerful a modern computers are, most are not quite big enough to handle Big Data. In fact, the term Big Data defines data that it are too big to store on a single computer. Instead, the data are stored on whole bunch of computers networked together, called a cluster. So I should have said, just throw the data on a bunch of computers somewhere.

Working as a data scientist,  I work on this type of cluster. But I have never actually seen or touched the computers!  In an effort to understand a little bit more about how they work, I decided to build my own personal Big Data computer cluster. 

Why I built a cluster

A project where I play around with computer hardware is literally something I never thought I would do. Since my first econometrics course, my sophomore year college, I have been playing with data. For the last decade, I have been doing some form of computer programming to handle the data. But, I always saw the computer programs that handle the data as a means to an end.  The truth is, I am mainly interested in answering some empirical question about economics or policy. Data allows me to answer the question. And, a computer was just the tool to manipulate the data towards that end. Until relatively recently, I only wanted to focus on the question and use the simplest tools I could to answer the question.

What changed? As the questions got more complex and the data got bigger, I needed to start thinking about more than just the math to answer the questions. I needed to think about the most efficient way to do the necessary calculations. I realized I can't just treat a computer like black box. If I consider the underlying design of the computer, I can use more data to answer harder questions faster.

Take the analogy of cooking a meal. If you are cooking a meal for one person, you can probably assume any kitchen will have most of what you need: a stove with 4 burners, an oven, a toaster. You get it. For one person, you can pretty much focus on the ingredients (the data) and the recipe (the analysis). But, if you are a caterer and are cooking a meal for 50 people, other questions about the kitchen become important. Do you have enough burners for all the dishes and a pot for each burner Which recipes need to be in the oven at 350 degrees and which need to be in at 450; so how should you stage which recipe goes into the oven when?

Small data is the dinner for one; most laptops regardless of their hardware can get the analysis done. But Big Data is the feast. Cooking the meal requires considering more than just the recipe and ingredients, its requires balancing the logistics of the tools available in the kitchen. 

I guess most caterers that face this problem but don't decide they need to solve it by building their own kitchen. But the beauty of  Moore's law is that it can be pretty cheap to go build you own cluster. And that provides an incredible opportunity to learn.

How I built the cluster

To build my cluster I bought the following equipment (with links below):


All in, this little experiment cost me about $210

The are Raspberry Pi's are essentially cheap little computer chips, similar in speed and compute power to the iPhone 4S (top of the line smart phones phones from 3.5 years ago). The computer chips sit on a board roughly the size of a credit card, attached to ports for ethernet cables, usb cables, and SD cards. 


Assembling the cluster was pretty simple. Each Rasperberry Pi gets an SD card to serve as the hard disk. Ethernet cables connect each of the Raspbery Pis to the router, allowing the computers to talk to each other. USB cables connect the Raspeberry Pi's to a power supply. And the wifi adapt is plugged into the USB port on pie, so the cluster can get on the internet. 

The hardest part about handling the hardware was figuring out how the keep the chords from getting tangled. So, I got some Legos.


Or maybe, I got the Raspberry Pis as an excuse to play with the Legos. 







Even though I now have a "big data" cluster to play with, it's actually a pretty bad computer. The chips run much slower than my MacBook Air. In fact the entire cluster has less disk space and less RAM than my laptop does. So its a big data cluster, that is worse than my laptop. This cluster is  not really much more than a toy for me to learn with. 

So what does the cluster mean for computations?

So I have 3 computers, each with a 32 gigabyte hard drive, and they are all networked together.  But I am not quite ready to do any calculations. 

First off all, each Raspberry Pi's computer chip is actually a "quad-core", which means that it is actually four chips connected to each other. I tend to think this means that I have 12 computers, each of which roughly speaking can do one calculation at a time. I have 12 threads.*  Each of these computers has a gigabyte of RAM (the memory that is very quickly accessible to the computer). This RAM needs to hold things like the operating system, so its not all available for my data science. But, each thread can relatively quickly access whatever data I load into memory. And finally, the SD card serves as a hard disk for each of the computers. Which means if I want to a calculation on a specific thread, the data should be sitting on the SD card attached to that particular Raspberry Pi. 

Putting all this together, for me to use use big data effectively, I would need to spend a lot of effort shuffling data between the three computers. This would ensure each thread would be doing a calculation and working efficiently. 

For example, I previously built an analysis of ticket prices to go to San Francisco Giants game. One of the many variables I needed to calculate was the average price of a ticket, by the section of the stadium that in which a seat is located. If I was using my cluster, the most efficient way to do this would be to make sure each thread is calculating a different section. And that each thread has a queue of sections to calculate, and the queue should be roughly equivalent so that the work is spread equally.  Some how, I need to tell the computers to do this, and make sure that the data for each ticket in same section is stored the same SD card. Then each thread can pull in the prices and calculate the average. 

Fortunately, there is an amazing piece of software that automates much of this shuffling, organizing data, and settings up the queue. It's called Hadoop. In the next few weeks, I'll be writing about what Hadoop does, and how I installed it on my Raspberry Pi's to get my cluster to work.

*I could "trick" the computer into thinking it has more or threads, and there might be reasons to do that. But I ignore that for the moment. 

No comments:

Post a Comment