Thursday, May 19, 2011

Installing Cloudera Hadoop (hadoop-0.20.2-cdh3u0) on Mac 10.6.x

Installing  Cloudera Hadoop CDH3 on Mac 10.6 was pretty straight forward - however, I did encounter couple of barriers and spent time on researching. So this blog intends to outline installation steps and possible solutions to the known issues.

  • Step #1: Download the Hadoop from Cloudera website (hadoop-0.20.2-cdh3u0.tar.gz)
  • Step #2: After unzipping the tar file into your working directory. Make following changes to the configuration file:

    • core-site.xml file ( /conf directory)

      <configuration>
        <property>
          <name>fs.default.name</name>
          <value>hdfs://localhost:8020</value>
        </property>
        <property>
           <name>hadoop.tmp.dir</name>
           <value>/var/lib/hadoop-0.20/cache/Kumar</value>
        </property>
      </configuration>

    • hdfs-site.xml file (under conf directory)

      <configuration>

      <property>
          <name>dfs.replication</name>
          <value>1</value>
        </property>
       
        <property>
           <!-- specify this so that running 'hadoop namenode -format' formats the right dir -->
           <name>dfs.name.dir</name>
           <value>/var/lib/hadoop-0.20/cache/hadoop/dfs/name</value>
        </property>
       
        <property>
           <name>dfs.data.dir</name>
           <value>/var/lib/hadoop-0.20/cache/hadoop/dfs/data</value>
        </property>
      </configuration>


      Make sure the read write permissions on the directory are enabled.
    • mapred-site.xml file (under conf directory)

      <configuration>
      <property>
          <name>mapred.job.tracker</name>
          <value>localhost:8021</value>
        </property>

      </configuration>


  • Step #3: Format the namenode:

    ./bin/hadoop namenode -format
  • Step #4: Start your hadoop cluster -

    ./bin/start-all.sh

  • Step #5: Test the HDFS system by coping a file into the hdfs

    //creates a directory named census-original-files  under /user/$username/

     ./bin/hadoop fs -mkdir  census-original-files

    //list the directory and you should see the above directory

     ./bin/hadoop fs -ls

    //copy a file from local to hdfs

    ./bin/hadoop fs -copyFromLocal README.txt /user/Kumar/census-original-files/

    //check the content of the file in hdfs

    ./bin/hadoop fs -cat /user/Kumar/census-original-files/README.txt

    //check whether all the mapreduce java tasks have started:

    jps

    expected java process (ignore port#):
    6722 SecondaryNameNode
    6571 NameNode
    6779 JobTracker
    6647 DataNode
    6855 TaskTracker



    if any of the above java process is missing - check the logs directory. (Mostly the configuration files are not set right or directory write permissions are missing.)


  • Step #6: To shutdown the cluster

    ./bin/stop-all.sh



Issues encountered:

  Issue #1: If you encounter errors while coping content into the HDFS as mentioned below. Possible solution is
physically delete all the files under the  dfs.name.dir & dfs.data.dir then run the format command (step #3). 
  If the above  steps does not solve the issue, validate the configurations under /conf directory. (Note - Cloudera distribution bundles  example conf files under /example-confs directory.)

could only be replicated to 0 nodes, instead of 1
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1469)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:649)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1415)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1411)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1409)


Issue #2: JobTracker or TaskTracker is not loading - it could possible be missing the configuration on the mapred-site.xml

FATAL org.apache.hadoop.mapred.JobTracker: java.lang.RuntimeException: Not a host:port pair: local
    at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:140)
    at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:124)
    at org.apache.hadoop.mapred.JobTracker.getAddress(JobTracker.java:2427)
    at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:2050)
    at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:2043)
    at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:294)
    at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:286)
    at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:4767)

Issue #3: Give read/write permission to all the directory configured for the dfs.

ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because org.apache.hadoop.util.DiskChecker$DiskErrorException: all local directories are not writable
    at org.apache.hadoop.mapred.TaskTracker.checkLocalDirs(TaskTracker.java:3495)
    at org.apache.hadoop.mapred.TaskTracker.initializeDirectories(TaskTracker.java:659)
    at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:734)
    at org.apache.hadoop.mapred.TaskTracker.<init>(TaskTracker.java:1431)
    at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3521)


Reference Links:

Hope these steps save you time.