In this post I will explain how to configure Apache Hadoop in Pseudo-Distributed Mode. In an earlier post I have explained how to install Apache Hadoop in Local (Standalone) Mode. Now I will apply the required configuration changes to turn that cluster into Pseudo-Distributed Mode.
Step 1 – Install Apache Hadoop in Local (Standalone) Mode: As starting point we need an Apache Hadoop cluster in Local (Standalone) Mode. In an earlier post I have described how to create such a cluster on CentOS 7 Linux running on VirtualBox (BigData Investigation 7 – Installing Apache Hadoop in Local (Standalone) Mode). Before executing the next steps, I cloned my Linux server in VirtualBox just in case I need a Hadoop Cluster in Local (Standalone) Mode in the future.
Step 2 – Configure password-less ssh: My Hadoop Cluster is running Hadoop 2.7.2. The documentation to configure Pseudo-Distributed Mode is here. I start with the configuration of password-less ssh to localhost. Ssh is required by the helper scripts which start and stop the Hadoop services and some other utilities. The Hadoop framework itself does not required ssh.
Login in as the user who installed the Hadoop code – on my system ‘storageulf’ – and then ssh to localhost. After adding the fingerprint of my server to the list of known hosts, the ssh command asks for the password. Abort the ssh command with ‘CTRL-C’.
login as: storageulf storageulf@127.0.0.1's password: Last login: Mon Sep 26 03:39:56 2016 from gateway [storageulf@hadoop ~]$ ssh localhost The authenticity of host 'localhost (::1)' can't be established. ECDSA key fingerprint is 01:4f:be:db:f7:60:2a:d5:ee:67:0b:aa:2e:60:f9:e7. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts. storageulf@localhost's password: [storageulf@hadoop ~]$
So, we need to create an ssh-key pair with the Linux ssh-keygen command and add the public key to the authorized_keys file. Now we can ssh to localhost without being prompted for a password. Finally logout from the remote shell to return to the login shell.
[storageulf@hadoop ~]$ ssh-keygen -t rsa -b 2048 -P '' -f ~/.ssh/id_rsa Generating public/private rsa key pair. Your identification has been saved in /home/storageulf/.ssh/id_rsa. Your public key has been saved in /home/storageulf/.ssh/id_rsa.pub. The key fingerprint is: ec:f0:9a:52:78:fc:f4:04:fd:e6:7c:8a:ed:af:9c:b6 storageulf@hadoop.storageulf The key's randomart image is: +--[ RSA 2048]----+ | | | | | . | | .. . | | o. S. . | | . ++. . o | | o ooo + | | . o. .++.. | | .o .oE*. | +-----------------+ [storageulf@hadoop ~]$ ls ~/.ssh/ id_rsa id_rsa.pub known_hosts [storageulf@hadoop ~]$ touch .ssh/authorized_keys [storageulf@hadoop ~]$ chmod 0600 ~/.ssh/authorized_keys [storageulf@hadoop ~]$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys [storageulf@hadoop ~]$ ssh localhost Last login: Tue Sep 27 11:58:10 2016 from gateway [storageulf@hadoop ~]$ logout logout Connection to localhost closed. [storageulf@hadoop ~]$
Step 3 – Configure Hadoop environment variables: Next we need to configure JAVA_HOME in hadoop-env.sh. Scripts which start the Hadoop services fail, if JAVA_HOME is not configured properly.
[storageulf@hadoop ~]$ set | grep JAVA_HOME JAVA_HOME=/etc/alternatives/jre_openjdk [storageulf@hadoop ~]$ which hadoop ~/hadoop/hadoop-2.7.2/bin/hadoop [storageulf@hadoop ~]$ cd ~/hadoop/hadoop-2.7.2/etc/hadoop/ [storageulf@hadoop hadoop]$ cp hadoop-env.sh hadoop-env.orig [storageulf@hadoop hadoop]$ vi hadoop-env.sh [storageulf@hadoop hadoop]$ diff hadoop-env.sh hadoop-env.orig 25,26c25 < ###export JAVA_HOME=${JAVA_HOME} < JAVA_HOME=/etc/alternatives/jre_openjdk --- > export JAVA_HOME=${JAVA_HOME} [storageulf@hadoop hadoop]$
Step 4 – Configure Hadoop Distributed File System (HDFS): To configure HDFS, we need to overwrite two default settings. First we need to set the default filesystem to ‘hdfs://localhost’ in ‘core-site.xml’. The documentation specifies also the port (‘hdfs://localhost:9000’), but I decided to specify no port. The web is full of examples, some are specifying port 9000, others specify no port. Though I have not found a single argument why to specify port 9000. Therefore I have decided to go with the default port which is 8020/tcp.
Second, HDFS creates per default three replicas of each file. For our single-node cluster we need to set the number of replicas to ‘1’ in hdfs-site.xml. On a new system both files include only comments, so that we just overwrite them.
[storageulf@hadoop hadoop]$ pwd /home/storageulf/hadoop/hadoop-2.7.2/etc/hadoop [storageulf@hadoop hadoop]$ cat > core-site.xml <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost</value> </property> </configuration> [storageulf@hadoop hadoop]$ cat > hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> [storageulf@hadoop hadoop]$
Step 5 – Format the filesystem: Now we are ready to format the filesystem. The complete output of the hdfs format command is available on GitHub.
[storageulf@hadoop hadoop]$ cd [storageulf@hadoop ~]$ which hdfs ~/hadoop/hadoop-2.7.2/bin/hdfs [storageulf@hadoop ~]$ hdfs namenode -format 16/09/27 14:46:01 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = hadoop.storageulf/10.0.2.15 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 2.7.2 ... STARTUP_MSG: java = 1.8.0_101 ************************************************************/ 16/09/27 14:46:01 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT] 16/09/27 14:46:01 INFO namenode.NameNode: createNameNode [-format] Formatting using clusterid: CID-d3cb987d-000a-4a81-8c1c-28aaf4be867b ... 16/09/27 14:46:02 INFO blockmanagement.BlockManager: defaultReplication = 1 ... 16/09/27 14:46:02 INFO namenode.FSNamesystem: fsOwner = storageulf (auth:SIMPLE) 16/09/27 14:46:02 INFO namenode.FSNamesystem: supergroup = supergroup 16/09/27 14:46:02 INFO namenode.FSNamesystem: isPermissionEnabled = true 16/09/27 14:46:02 INFO namenode.FSNamesystem: HA Enabled: false 16/09/27 14:46:02 INFO namenode.FSNamesystem: Append Enabled: true 16/09/27 14:46:02 INFO namenode.FSDirectory: ACLs enabled? false 16/09/27 14:46:02 INFO namenode.FSDirectory: XAttrs enabled? true 16/09/27 14:46:02 INFO namenode.FSDirectory: Maximum size of an xattr: 16384 ... 16/09/27 14:46:02 INFO common.Storage: Storage directory /tmp/hadoop-storageulf/dfs/name has been successfully formatted. 16/09/27 14:46:02 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0 16/09/27 14:46:02 INFO util.ExitUtil: Exiting with status 0 16/09/27 14:46:02 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at hadoop.storageulf/10.0.2.15 ************************************************************/ [storageulf@hadoop ~]$
Step 6 – Configure MapReduce and YARN: To run MapReduce on YARN in Pseudo-Distributed mode we need to set a few additional parameters. Again the respective configuration files include only comments, so that we overwrite the whole files.
[storageulf@hadoop ~]$ cd hadoop/hadoop-2.7.2/etc/hadoop/ [storageulf@hadoop hadoop]$ cat > mapred-site.xml <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> [storageulf@hadoop hadoop]$ cat > yarn-site.xml <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration> [storageulf@hadoop hadoop]$ cd ~ [storageulf@hadoop ~]$
Step 7 – Start the services: Finally we are ready to start the services. We start HDFS (start-dfs.sh), YARN (start-yarn.sh) and the MapReduce Job History server (mr-jobhistory-daemon.sh).
[storageulf@hadoop ~]$ which start-dfs.sh ~/hadoop/hadoop-2.7.2/sbin/start-dfs.sh [storageulf@hadoop ~]$ start-dfs.sh Starting namenodes on [localhost] localhost: starting namenode, logging to /home/storageulf/hadoop/hadoop-2.7.2/logs/hadoop-storageulf-namenode-hadoop.storageulf.out localhost: starting datanode, logging to /home/storageulf/hadoop/hadoop-2.7.2/logs/hadoop-storageulf-datanode-hadoop.storageulf.out Starting secondary namenodes [0.0.0.0] The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established. ECDSA key fingerprint is 01:4f:be:db:f7:60:2a:d5:ee:67:0b:aa:2e:60:f9:e7. Are you sure you want to continue connecting (yes/no)? yes 0.0.0.0: Warning: Permanently added '0.0.0.0' (ECDSA) to the list of known hosts. 0.0.0.0: starting secondarynamenode, logging to /home/storageulf/hadoop/hadoop-2.7.2/logs/hadoop-storageulf-secondarynamenode-hadoop.storageulf.out [storageulf@hadoop ~]$ which start-yarn.sh ~/hadoop/hadoop-2.7.2/sbin/start-yarn.sh [storageulf@hadoop ~]$ start-yarn.sh starting yarn daemons starting resourcemanager, logging to /home/storageulf/hadoop/hadoop-2.7.2/logs/yarn-storageulf-resourcemanager-hadoop.storageulf.out localhost: starting nodemanager, logging to /home/storageulf/hadoop/hadoop-2.7.2/logs/yarn-storageulf-nodemanager-hadoop.storageulf.out [storageulf@hadoop ~]$ which mr-jobhistory-daemon.sh ~/hadoop/hadoop-2.7.2/sbin/mr-jobhistory-daemon.sh [storageulf@hadoop ~]$ mr-jobhistory-daemon.sh start historyserver starting historyserver, logging to /home/storageulf/hadoop/hadoop-2.7.2/logs/mapred-storageulf-historyserver-hadoop.storageulf.out [storageulf@hadoop ~]$
Step 8 – Check results: Now our Pseudo-Distributed cluster is completely configured. Let’s check which processes are running. There are six Hadoop process running: namenode, datanode and secondarynamenode for HDFS, resourcemanager and nodemanger for YARN, and historyserver for the MapReduce Job History server.
[storageulf@hadoop ~]$ ps x PID TTY STAT TIME COMMAND 14937 ? S 0:00 sshd: storageulf@pts/0 14938 pts/0 Ss 0:00 -bash 16257 ? Sl 0:05 /etc/alternatives/jre_openjdk/bin/java -Dproc_namenode ... 16378 ? Sl 0:04 /etc/alternatives/jre_openjdk/bin/java -Dproc_datanode ... 16533 ? Sl 0:03 /etc/alternatives/jre_openjdk/bin/java -Dproc_secondarynamenode ... 16685 pts/0 Sl 0:09 /etc/alternatives/jre_openjdk/bin/java -Dproc_resourcemanager ... 16799 ? Sl 0:07 /etc/alternatives/jre_openjdk/bin/java -Dproc_nodemanager ... 17123 pts/0 Sl 0:05 /etc/alternatives/jre_openjdk/bin/java -Dproc_historyserver ... 17315 pts/0 R+ 0:00 ps x [storageulf@hadoop ~]$
The six services listen on a plenty of TCP sockets and establish a few sessions among themselves. I will study the usage of the ports in a later post.
[storageulf@hadoop ~]$ lsof -i -P COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME java 16257 storageulf 183u IPv4 71367 0t0 TCP *:50070 (LISTEN) java 16257 storageulf 200u IPv4 71381 0t0 TCP localhost:8020 (LISTEN) java 16257 storageulf 210u IPv4 72422 0t0 TCP localhost:8020->localhost:34896 (ESTABLISHED) java 16378 storageulf 183u IPv4 72179 0t0 TCP *:50010 (LISTEN) java 16378 storageulf 187u IPv4 72190 0t0 TCP localhost:41897 (LISTEN) java 16378 storageulf 210u IPv4 72406 0t0 TCP *:50075 (LISTEN) java 16378 storageulf 213u IPv4 72413 0t0 TCP *:50020 (LISTEN) java 16378 storageulf 226u IPv4 72418 0t0 TCP localhost:34896->localhost:8020 (ESTABLISHED) java 16533 storageulf 194u IPv4 73412 0t0 TCP *:50090 (LISTEN) java 16685 storageulf 193u IPv6 74413 0t0 TCP *:8031 (LISTEN) java 16685 storageulf 204u IPv6 74434 0t0 TCP *:8030 (LISTEN) java 16685 storageulf 214u IPv6 74639 0t0 TCP *:8032 (LISTEN) java 16685 storageulf 224u IPv6 76657 0t0 TCP *:8088 (LISTEN) java 16685 storageulf 231u IPv6 78101 0t0 TCP *:8033 (LISTEN) java 16685 storageulf 241u IPv6 78121 0t0 TCP hadoop.storageulf:8031->hadoop.storageulf:44274 (ESTABLISHED) java 16799 storageulf 200u IPv6 78083 0t0 TCP *:39078 (LISTEN) java 16799 storageulf 211u IPv6 78092 0t0 TCP *:8040 (LISTEN) java 16799 storageulf 221u IPv6 78096 0t0 TCP *:13562 (LISTEN) java 16799 storageulf 222u IPv6 78111 0t0 TCP *:8042 (LISTEN) java 16799 storageulf 231u IPv6 78113 0t0 TCP hadoop.storageulf:44274->hadoop.storageulf:8031 (ESTABLISHED) java 17123 storageulf 200u IPv4 78420 0t0 TCP *:10033 (LISTEN) java 17123 storageulf 211u IPv4 78444 0t0 TCP *:19888 (LISTEN) java 17123 storageulf 218u IPv4 78449 0t0 TCP *:10020 (LISTEN) [storageulf@hadoop ~]$
Ulf’s Conclusion
Apache Hadoop installs per default in Local (Standalone) Mode. Turning a fresh installed Apache Hadoop cluster into Pseudo-Distributed Mode is straight forward. It just needs the adjustment of a few configuration files and the start of the services. Each Hadoop service (e.g. HDFS NameNode, HDFS DataNode, HDFS Secondary NameNode, YARN ResourceManager, YARN NodeManager, MapReduce Job History Server) runs in a separate Java Virtual Machine (JVM) in separate Linux processes.
In the next post I will explain how to run the example Hadoop Streaming application which I used in earlier posts on my new Pseudo-Distributed Mode Apache Hadoop Cluster to better understand the differences between Local (Standalone) Mode and Pseudo-Distributed Mode.
Changes:
2016/10/28 added link – “on my new Pseudo-Distributed Mode Apache Hadoop Cluster” => BigData Investigation 10 – Using Hadoop Streaming on Hadoop Cluster in Pseudo-Distributed Mode