Saturday, June 11, 2011

mini-cluster - Additional Datanodes (CDH3 pseudo dist on Mac 10.6.x)


Intention: Describe the configuration steps to run additional data nodes on pseudo-distributed Hadoop CDH3 0.20.x cluster.
Environment: Hadoop CDH3 0.20.x ,  Mac OS X 10.6.x

 There are multiple post on the Hadoop mailing list on this topic, however, nothing seemed to work for me because they were described for version CDH3 0.21. If you are running Hadoop CDH3 0.20.x on your Mac machine this configuration outline might be helpful and save you some valuable time.


Step #1: In your HADOOP_HOME directory, copy the "conf" directory to, say, "conf2"


Step #2: In the conf2 directory, edit as follows:

  a) In hadoop-env.sh, provide unique non-default HADOOP_IDENT_STRING, e.g. ${USER}_02
  b) In hdfs-site.xml, change dfs.data.dir to show the desired targets/volumes for datanode#2, and of course make sure the corresponding target directories exist.  Also remove these targets from the dfs.data.dir target list for datanode#1 in conf/hdfs-site.xml.
  c) in hdfs-site.xml, set the four following "address:port" strings to something non-conflicting with the other datanode and other processes running on this box:
    - dfs.datanode.address  (default 0.0.0.0:50010)
    - dfs.datanode.ipc.address  (default 0.0.0.0:50020)
    - dfs.datanode.http.address  (default 0.0.0.0:50075)
    - dfs.datanode.https.address  (default 0.0.0.0:50475)
Note: the defaults above are what datanode#1 is probably running on.  I added 2 to each port number for datanode#2 and it seemed to work okay.  You might also wish to note the default ports associated with the namenode and job/task tracker processes, in case they are running on the same box:
    - fs.default.name  0.0.0.0:9000
    - dfs.http.address  0.0.0.0:50070
    - dfs.https.address  0.0.0.0:50470
    - dfs.secondary.http.address  0.0.0.0:50090
    - mapred.job.tracker.http.address  0.0.0.0:50030
    - mapred.task.tracker.report.address  127.0.0.1:0
    - mapred.task.tracker.http.address  0.0.0.0:50060

Step #3:  Starting the DataNode #2


(This is the step that differs from the 0.20 versus 0.21 - remove the pid and use the hadoop daemon as below to start the additional data node.)


   rm ./pids/hadoop-Kumar-datanode.pid

  ./bin/hadoop-daemon.sh --config ./conf2 start datanode # ./conf2 DN



Mailing List References: 
  • Configuration for Hadoop 0.20.x refer the mailing list entry below:

    Link to Hadoop mailing list

  • Configuration for Hadoop 0.21.x refer the mailing list entry below:
         Link to Hadoop mailing list (Check Matthew Foley message)