Wednesday, 26 September 2012

Configuring hadoop on CentOS cluster


This blog gives you a step by step straight forward way to configure hadoop on your centos cluster.
My cluster consists of 6 nodes (1 Master and rest are Slaves).

Master node runs two daemons named :JobTracker, Namenode and Secondary Namenode.
Rest of the slaves node runs TaskTracker and Datanodes.

Steps are:

    -creating separate user on all nodes.
       useradd hadoop
        passwd *****

       Login through hadoop user and follow the steps on all the nodes.

    -change the hostname if you want.
        in /etc/sysconfig/network add hostname=HadoopMaster (or HadoopSlave1..5)

    -configure etc/hosts file on all nodes.
         172.29.100.191    hadoopmaster.company.local HadoopMaster
         172.29.100.126  hadoopslave1.company.local HadoopSlave1
         172.29.100.106    hadoopslave2.company.local HadoopSlave2
         172.29.100.178    hadoopslave3.company.local HadoopSlave3
         172.29.100.199    hadoopslave4.company.local HadoopSlave4
         172.29.100.140    hadoopslave5.company.local HadoopSlave5

    -configuring ssh access between all nodes
        ssh-keygen -t dsa (at master home)
        ssh-copy-id -i /home/hadoop/.ssh/id_dsa hadoop@HadoopSlave1 (copying id_rsa from master    to all nodes.)
        ssh-copy-id -i /home/hadoop/.ssh/id_dsa hadoop@HadoopSlave2 (all the slaves node)

    -installing hadoop (on all the nodes)
        cd /home/hadoop
        wget http://mirror.cloudera.com/apache/hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz
        tar -xvzf hadoop-0.20.2.tar.gz (extract tar)
        mv hadoop-0.20.2 hadoop

    -files to be modified in hadoop/conf directory (on all the nodes)
        conf/core-site.xml
            <property>
              <name>fs.default.name</name>
              <value>hdfs://HadoopMaster:9000/</value>
            </property>

        conf/hdfs-site.xml
            <property>
              <name>dfs.name.dir</name>
              <value>/home/hadoop/hdfs/name</value>
            </property>
            <property>
              <name>dfs.data.dir</name>
              <value>/home/hadoop/hdfs/data</value>
            </property>
            <property>
              <name>dfs.replication</name>
              <value>3</value>
            </property>

        conf/mapred-site.xml
            <property>
              <name>mapred.job.tracker</name>
              <value>HadoopMaster:9001</value>
            </property>

        conf/hadoop-env.sh
            export JAVA_HOME=/usr/java/jdk1.6.0_18
            export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

        conf/masters
            HadoopMaster

        conf/slaves
            HadoopSlave1
            HadoopSlave2
            HadoopSlave3
            HadoopSlave4
            HadoopSlave5

Starting hadoop:

format the namenode if required by
    -bin/hadoop namenode -format (format all the nodes)

    -Note : If error occurs check in the hadoop logs which is configured in hadoop-env.sh and debug
   
    -bin/start-all.sh (only on master node)




After Hadoop, If you like you can go through the simple Hbase configuration too..

This is a hbase managed zookeeper configuration(its by default configuration)
   
Changes done in files are:
1) /home/hadoop/hbase-0.92.1/conf/hbase-env.sh
        ++ export HBASE_HOME=/home/hadoop/hbase-0.92.1
        ++ export HBASE_PID_DIR=/home/hadoop/var/hbase/pids (storing pids of hbase daemons)
        ++ export JAVA_HOME=/usr/java/jdk1.7.0_05/
        #export HBASE_MANAGES_ZK=false (commenting out htis line because by default it is TRUE)
2) /home/hadoop/hbase-0.92.1/conf/hbase-site.xml
<configuration>
    <property>
        <name>hbase.rootdir</name>
        <value>hdfs://HadoopMaster:9000/hbase</value>
        <description>The directory shared by RegionServers.</description>
    </property>
    <property>
        <name>hbase.cluster.distributed</name>
        <value>true</value>
        <description>The mode the cluster will be in. Possible values are
            false: standalone and pseudo-distributed setups with managed Zookeeper
            true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)</description>
    </property>
    <property>
        <name>hbase.zookeeper.property.clientPort</name>
        <value>2222</value>
        <description>Property from ZooKeeper's config zoo.cfg.
            The port at which the clients will connect.
        </description>
    </property>
    <property>
      <name>hbase.zookeeper.quorum</name>
      <value>hadoopmaster.company.local</value>
      <description>Comma separated list of servers in the ZooKeeper Quorum.
      For example, "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com".
      By default this is set to localhost for local and pseudo-distributed modes
      of operation. For a fully-distributed setup, this should be set to a full
      list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in hbase-env.sh
      this is the list of servers which we will start/stop ZooKeeper on.
      </description>
    </property>
    <property>
      <name>hbase.zookeeper.property.dataDir</name>
      <value>/home/hadoop/var/zookeeper</value>
      <description>Property from ZooKeeper's config zoo.cfg.
      The directory where the snapshot is stored.
      </description>
    </property>
</configuration>

3) /home/hadoop/hbase-0.92.1/conf/regionservers
    Added the name of region servers:
        HadoopSlave1
        HadoopSlave2
        HadoopSlave3
        HadoopSlave4
        HadoopSlave5

Note: From hbase version 0.90 onwards it uses SASL authentication for communication(it is optional) but here I have skipped this functionality.

Starting hbase:

Stoping hbase:
    MASTER--
    ./stop-hbase
    REGIONSERVERS--
    ./hbase-daemon.sh stop regionserver
    ./hbase-daemon.sh stop zookeeper