9:00 am: Good morning!
9:30 am: Intro to Hadoop and MapReduce
10:30 am: Set up hadoop on your own DigitalOcean droplets:
You will be creating temporary new droplets for this exercise, not using your McNulty machines.
Create a new droplet, $10 per month, 1 GB RAM. Hadoop will require at least that much memory, so we cannot get away with a 512 MB machine.
What we are setting up is one namenode (master), one secondary namenode (checker and keeper of master's health) and one datanode (slave). Normally, there is a single namenode and a single secondary namenode, but many, many datanodes. In this simple setup, we will have a single datanode.
When you set these up, on each machine you run the server program (the namenode server that runs on its on computer)
In a cool twist, we can set this up in an even simpler way. We can run all three servers (namenode, secondary namenode and datanode) on the same machine, and set the ips to all these "machines" as localhost. So, the system works exactly the same, the servers ssh to each other to communicate, but it's all on the same machine.
This setup is the simplest Hadoop cluster setup. It is still a full cluster, but we don't need to fire up more than one cloud computer. However, the interface (how we use it) is exactly the same way if we had a separate computer for each of these three nodes, or if we had a thousand datanodes.
Once you fire up your 1GB droplet, follow this tutorial to create this setup.
11:30 am: Simple Map and Reduce on the command line
12:00 pm: Lunch
1:30 pm: Hadoop tutorial
5:00 pm: Stop Work
By now you will have done word counts on the 3 Gutenberg texts. Now, instead of word counts, compute the tf-idf for all words in these texts.
Hadoop: the Definitive Guide (3rd Ed.)
Good book by Apache Hadoop contributor Tom White what is apache?
what is nutch?
TED Talk (Peter Diamandis 90% of data 2 years)
practical guide "Hadoop in my IT department: How to plan a cluster?"
Great article comparing DFS's (GFS, HDFS, Amazon Dynamo Microsoft Azure)
big data article 1: the hard sell. "Addressing 5 objections to big data"
big data article 2: the singularity is coming, a brief take by a creepy big data optimist