Saturday, June 11, 2011

mini-cluster - Additional Datanodes (CDH3 pseudo dist on Mac 10.6.x)


Intention: Describe the configuration steps to run additional data nodes on pseudo-distributed Hadoop CDH3 0.20.x cluster.
Environment: Hadoop CDH3 0.20.x ,  Mac OS X 10.6.x

 There are multiple post on the Hadoop mailing list on this topic, however, nothing seemed to work for me because they were described for version CDH3 0.21. If you are running Hadoop CDH3 0.20.x on your Mac machine this configuration outline might be helpful and save you some valuable time.


Step #1: In your HADOOP_HOME directory, copy the "conf" directory to, say, "conf2"


Step #2: In the conf2 directory, edit as follows:

  a) In hadoop-env.sh, provide unique non-default HADOOP_IDENT_STRING, e.g. ${USER}_02
  b) In hdfs-site.xml, change dfs.data.dir to show the desired targets/volumes for datanode#2, and of course make sure the corresponding target directories exist.  Also remove these targets from the dfs.data.dir target list for datanode#1 in conf/hdfs-site.xml.
  c) in hdfs-site.xml, set the four following "address:port" strings to something non-conflicting with the other datanode and other processes running on this box:
    - dfs.datanode.address  (default 0.0.0.0:50010)
    - dfs.datanode.ipc.address  (default 0.0.0.0:50020)
    - dfs.datanode.http.address  (default 0.0.0.0:50075)
    - dfs.datanode.https.address  (default 0.0.0.0:50475)
Note: the defaults above are what datanode#1 is probably running on.  I added 2 to each port number for datanode#2 and it seemed to work okay.  You might also wish to note the default ports associated with the namenode and job/task tracker processes, in case they are running on the same box:
    - fs.default.name  0.0.0.0:9000
    - dfs.http.address  0.0.0.0:50070
    - dfs.https.address  0.0.0.0:50470
    - dfs.secondary.http.address  0.0.0.0:50090
    - mapred.job.tracker.http.address  0.0.0.0:50030
    - mapred.task.tracker.report.address  127.0.0.1:0
    - mapred.task.tracker.http.address  0.0.0.0:50060

Step #3:  Starting the DataNode #2


(This is the step that differs from the 0.20 versus 0.21 - remove the pid and use the hadoop daemon as below to start the additional data node.)


   rm ./pids/hadoop-Kumar-datanode.pid

  ./bin/hadoop-daemon.sh --config ./conf2 start datanode # ./conf2 DN



Mailing List References: 
  • Configuration for Hadoop 0.20.x refer the mailing list entry below:

    Link to Hadoop mailing list

  • Configuration for Hadoop 0.21.x refer the mailing list entry below:
         Link to Hadoop mailing list (Check Matthew Foley message)

Thursday, May 19, 2011

Installing Cloudera Hadoop (hadoop-0.20.2-cdh3u0) on Mac 10.6.x

Installing  Cloudera Hadoop CDH3 on Mac 10.6 was pretty straight forward - however, I did encounter couple of barriers and spent time on researching. So this blog intends to outline installation steps and possible solutions to the known issues.

  • Step #1: Download the Hadoop from Cloudera website (hadoop-0.20.2-cdh3u0.tar.gz)
  • Step #2: After unzipping the tar file into your working directory. Make following changes to the configuration file:

    • core-site.xml file ( /conf directory)

      <configuration>
        <property>
          <name>fs.default.name</name>
          <value>hdfs://localhost:8020</value>
        </property>
        <property>
           <name>hadoop.tmp.dir</name>
           <value>/var/lib/hadoop-0.20/cache/Kumar</value>
        </property>
      </configuration>

    • hdfs-site.xml file (under conf directory)

      <configuration>

      <property>
          <name>dfs.replication</name>
          <value>1</value>
        </property>
       
        <property>
           <!-- specify this so that running 'hadoop namenode -format' formats the right dir -->
           <name>dfs.name.dir</name>
           <value>/var/lib/hadoop-0.20/cache/hadoop/dfs/name</value>
        </property>
       
        <property>
           <name>dfs.data.dir</name>
           <value>/var/lib/hadoop-0.20/cache/hadoop/dfs/data</value>
        </property>
      </configuration>


      Make sure the read write permissions on the directory are enabled.
    • mapred-site.xml file (under conf directory)

      <configuration>
      <property>
          <name>mapred.job.tracker</name>
          <value>localhost:8021</value>
        </property>

      </configuration>


  • Step #3: Format the namenode:

    ./bin/hadoop namenode -format
  • Step #4: Start your hadoop cluster -

    ./bin/start-all.sh

  • Step #5: Test the HDFS system by coping a file into the hdfs

    //creates a directory named census-original-files  under /user/$username/

     ./bin/hadoop fs -mkdir  census-original-files

    //list the directory and you should see the above directory

     ./bin/hadoop fs -ls

    //copy a file from local to hdfs

    ./bin/hadoop fs -copyFromLocal README.txt /user/Kumar/census-original-files/

    //check the content of the file in hdfs

    ./bin/hadoop fs -cat /user/Kumar/census-original-files/README.txt

    //check whether all the mapreduce java tasks have started:

    jps

    expected java process (ignore port#):
    6722 SecondaryNameNode
    6571 NameNode
    6779 JobTracker
    6647 DataNode
    6855 TaskTracker



    if any of the above java process is missing - check the logs directory. (Mostly the configuration files are not set right or directory write permissions are missing.)


  • Step #6: To shutdown the cluster

    ./bin/stop-all.sh



Issues encountered:

  Issue #1: If you encounter errors while coping content into the HDFS as mentioned below. Possible solution is
physically delete all the files under the  dfs.name.dir & dfs.data.dir then run the format command (step #3). 
  If the above  steps does not solve the issue, validate the configurations under /conf directory. (Note - Cloudera distribution bundles  example conf files under /example-confs directory.)

could only be replicated to 0 nodes, instead of 1
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1469)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:649)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1415)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1411)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1409)


Issue #2: JobTracker or TaskTracker is not loading - it could possible be missing the configuration on the mapred-site.xml

FATAL org.apache.hadoop.mapred.JobTracker: java.lang.RuntimeException: Not a host:port pair: local
    at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:140)
    at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:124)
    at org.apache.hadoop.mapred.JobTracker.getAddress(JobTracker.java:2427)
    at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:2050)
    at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:2043)
    at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:294)
    at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:286)
    at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:4767)

Issue #3: Give read/write permission to all the directory configured for the dfs.

ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because org.apache.hadoop.util.DiskChecker$DiskErrorException: all local directories are not writable
    at org.apache.hadoop.mapred.TaskTracker.checkLocalDirs(TaskTracker.java:3495)
    at org.apache.hadoop.mapred.TaskTracker.initializeDirectories(TaskTracker.java:659)
    at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:734)
    at org.apache.hadoop.mapred.TaskTracker.<init>(TaskTracker.java:1431)
    at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3521)


Reference Links:

Hope these steps save you time.