Sunday, July 24, 2016

Hadoop... There It Is (Part 2)

Well, at long last, I have completed my Hadoop Raspberry Cluster. It took a couple of months to dive back into this project. I have my own personal cloud, running similar technology that power some of the worlds most important tech companies. However, my cloud is pretty lame. It less powerful than the MacBook Air that I am currently writing this post on. But, at least it's complete and time to write about it!





What did I learn?
I dove into the project, looking for an opportunity to gain an appreciation for the relationship between hardware and the computing resources necessary for Data Science. However, the learning experience was a little disappointing.

Building the cluster  didn't some deeper appreciation for the way this piece of technology works. First, I followed many step by step instructions which pretty much just told me what to do. I should give these instructions a lot of credit for helping me do this, they can be found here (Jonas Widriksson) and here (Carsten Mönning).* Second, Hadoop, for all the magic of being "the cloud" is a piece of software like anything else. There are some configs to deal with but it's mostly off the shelf. I guess, this is a complement to hadoop, its not all that hard to set up and is relatively easy to get running. This allows you to focus on the functionality, which is exactly how software should be engineered, but at same time it allowed be to be less cognizant about the relationship between software and hardware.

However, I learned a couple of important things. I learned some handy tips about working with the Raspberry Pi's. For example, spend the the little bit of extra money to get SD cards with operating system installed. Also, it may be a little easier to work with them if can plug the Pi into a monitor and keyboard in directly rather than ssh-ing in to do anything. It's also a good idea to have a computer with enough disk-space that you can clone the SD drives on them (this was the major delay actually). Finally, make sure all your hardware works, I was held up by a faulty router.

Though I am glad to have my own cluster now, I learned that I don't particularly enjoy the installation and building aspect of computers. I frequently felt frustrated getting over multiple little hurdles, like realizing my router was faulty, or find the folder where an executable lived because I war reading the wrong instruction.  This frustration outweighed the pleasure in completing the project. I'm glad there are some people who enjoy this type of work (in-fact, I would love to work with them), but its not for me. I'll stick to data and analyses from now on.

What can I do with it?
In my last post on this topic, I described the hardware and how it relates to computations I might be working on, but I can expand on it here. Hadoop provides two services that are incredibly valuable for big data.

A number of a computers networked together is generally called a cluster. Hadoop is the operating system for the cluster. In windows, iOS, or linux, a user would navigate around to find files in different folders. If a user had to do that for every separate computer on the cluster, programming would be pretty complicated. So Hadoop File System (HDFS) sets up a virtual system, where the user can store your data in different directories, but these directories are not necessarily permanently tied to a single machine. Hadoops figures out how to do that for the use.

Recognizing that computers sometimes crash (and with many computers, one will almost certainly crash), it stores multiple copies of the data across different machines. This minimizes the chances that the computer actually loses data, or the user is unable to access it when they need it. Previously, I did some cloud computing where I literally had to spin up the equivalent to a thousand small separate computers (there were reasons why it made sense for that project, trust me), and the lack of a centralized file system was a real pain!

The second thing Hadoop offers is MapReduce. MapReduce a computational framework that works very well for distributed computing, or doing calculations across many computers. Fortunately, MapReduce isn't very creatively named; the name describes pretty much exactly what it does. MapReduce has 3 steps, but the user really only has to worry about two of them: Map and Reduce. In the Map step, the user takes an observation of data and gives Hadoop a key and value. The user can also do any transformation to the data, with the caveat that it can only look at one observation at a time. Hadoop then shuffles (the hidden second step) the data, so every piece with the same key is on the same computer. Then, Hadoop "reduces" the data. It streams through the data with like keys, and does some mathematical operation to them (like adding them together).

It turns out many complex data processing and machine learning algorithms can be written as a sequence of various map and reduce steps. Programs like Hive and Pig essentially do MapReduce under the hood, but let you call commands that are abstracted away from the Map and Reduce step. So this programming paradigm is great for computation on clusters, and Hadoop lets you do it.

So thats pretty much it. By being a filesystem for multiple computers and facilitating data calculations, Hadoop has become the de-facto standard for big data. It works on essentially any machine, from small Raspberry Pis, to "commodity-grade" servers. It's not that this type of distributed computing is new but Hadoop made it cheap and accessible.

What will I do with it
So now that I have this cluster up and running there are a world of things I can do it with it. So far, I have used this "awesome" technology, to a take a document of words, and count how frequently each word appears. Data Scientists: we count words!

Now that its up and running, I need to find some projects that take advantage of it. Of course, with it being smaller than my laptop, there isn't a ton I can do. When I think about it, the biggest thing that this buys me is a computer that is always on, because I only own laptops otherwise. So I have a couple data projects, using Twitter and those little Amazon buttons that will be easier with a computer that catches data whenever it comes it. But, I will also find ways to take advantage of the parallel computing on this machine!

 *I found the Widriksson instructions a little better, but they were using an old version of haddop, so the Mönning instructions were a little more up to date.


No comments:

Post a Comment