In this post I will explain how to run the Hadoop Streaming utility on a Hadoop Cluster in Local (Standalone) Mode. Hadoop Streaming uses executables or scripts to create a MapReduce job and submits the job to a Hadoop cluster. In an earlier post I have explained how to download and install Apache Hadoop in Local (Standalone) Mode. For this post I will use that Apache Hadoop Cluster to run Hadoop Streaming on it.
I recommend that you read my introduction to Hadoop Streaming (BigData Investigation 5 – MapReduce with Python and Hadoop Streaming), before you continue to read this post. In the Hadoop Streaming post I have analyzed sample data by using Hadoop Streaming and two Python scripts on the Cloudera Quickstart VM. For your convenience I am copying the syntax of the hadoop command here. See the Hadoop Streaming post for details.
[cloudera@quickstart hadoop-book]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \ -input file:/home/cloudera/hadoop-book/input/ncdc/sample.txt \ -output file:/tmp/output_storageulf \ -mapper /home/cloudera/hadoop-book/ch02-mr-intro/src/main/python/max_temperature_map.py \ -reducer /home/cloudera/hadoop-book/ch02-mr-intro/src/main/python/max_temperature_reduce.py
We need to adjust the syntax of this command to run it on our Apache Hadoop Cluster in Local (Standalone) Mode. First we need to find the location of the hadoop-streaming jar file.
[storageulf@hadoop ~]$ locate hadoop-streaming | grep jar /home/storageulf/hadoop/hadoop-2.7.2/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar /home/storageulf/hadoop/hadoop-2.7.2/share/hadoop/tools/sources/hadoop-streaming-2.7.2-sources.jar /home/storageulf/hadoop/hadoop-2.7.2/share/hadoop/tools/sources/hadoop-streaming-2.7.2-test-sources.jar [storageulf@hadoop ~]
Next we need to adjust the paths for the input data and the output directory. This is required, because Cloudera Quickstart VM is configured in Pseudo-Distributed Mode whilst the Apache Hadoop Cluster which I am using for this post is configured in Standalone (Local) Mode. That’s is all. For the curious reader I made the complete output available at GitHub.
[storageulf@hadoop ~]$ which hadoop ~/hadoop/hadoop-2.7.2/bin/hadoop [storageulf@hadoop ~]$ hadoop jar /home/storageulf/hadoop/hadoop-2.7.2/share/hadoop/tools/lib/ hadoop-streaming-2.7.2.jar \ -input ~/hadoop-book/input/ncdc/sample.txt \ -output ~/output \ -mapper ~/hadoop-book/ch02-mr-intro/src/main/python/max_temperature_map.py \ -reducer ~/hadoop-book/ch02-mr-intro/src/main/python/max_temperature_reduce.py ... 16/09/26 03:46:05 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= ... 16/09/26 03:46:08 INFO streaming.StreamJob: Output directory: /home/storageulf/output [storageulf@hadoop ~]$
The output is stored in the directory “output” in file “part-00000”. The output is exactly the same as for the command which we have used in the Hadoop Streaming post.
[storageulf@hadoop ~]$ ls -l output/ total 4 -rw-r--r--. 1 storageulf storageulf 17 Sep 26 03:46 part-00000 -rw-r--r--. 1 storageulf storageulf 0 Sep 26 03:46 _SUCCESS [storageulf@hadoop ~]$ cat output/part-00000 1949 111 1950 22 [storageulf@hadoop ~]$
Please note that the output directory and all it files are owned by storageulf:storageulf. This is different to the output of Hadoop Streaming on the Cloudera QuickStart VM where the output directory and all files were owned by yarn:yarn.
Hadoop in Local (Standalone) Mode runs the Hadoop application and all Hadoop services in a single Java Virtual Machine (JVM) which runs with the uid and gid of the user who issues the hadoop command. There are no “java” processes running for Hadoop components such as HDFS and YARN. This explains why in this post the output directory and its files are owned by storageulf:storageulf.
[storageulf@hadoop ~]$ ps aux | grep java storage+ 10154 0.0 0.0 112648 976 pts/0 R+ 03:48 0:00 grep --color=auto java [storageulf@hadoop ~]$
Ulf’s Conclusion
Running the Hadoop Streaming example application on our home-made Hadoop Cluster in Local (Standalone) Mode delivers the same results as on the Cloudera QuickStart VM. This was actually expected, but it is good to know that our first home-made cluster works as designed. By configuring Apache Hadoop on my own, I now understand the internals of Hadoop a little bit better.
In the next post I will explain how to install Apache Hadoop in Pseudo-Distributed Mode to get a test system where all Hadoop services run as separate processes.
Changes:
2016/10/15 added link – “how to install Apache Hadoop in Pseudo-Distributed Mode” => BigData Investigation 9 – Installing Apache Hadoop in Pseudo-Distributed Mode