Thursday, June 19, 2014

How to setup Cloudera Manager 5.x and Hadoop on Ubuntu 14.04 LTS servers ?




When you are told to setup a tool you will make a choice to use latest stable version. With this motto, recently when we were setting up our new cluster, we choose ubuntu 14.04 LTS servers for our infrastructure.

Setting up hadoop and monitoring them turns out to be cumbersome for new users. Cloudera seems to be doing great service by making it simpler.

Being faced with situation of not having official support from cloudera for ubuntu 14.04, we managed to hack few steps and got it working.

This post describes,
 * configuring hosts and name resolutions
 * setting up of cloudera manager on ubuntu 14.04 and hacks to make it working
 * Configuring MySQL database for Cloudera daemons’ bookkeeping  
 * Deploying 2 node hadoop cluster with cloudera manager

Before you begin:
 * emacs is used for editing the configurations.
    To install emacs :
       sudo apt-get update && sudo apt-get install emacs24-nox
    => use vi or vim if you're not familiar with emacs
* ssh is used through this for login to hosts, and login credentials are assumed.

1. Set Host names :
Configuring hostnames properly is very essential.
Lets name two hosts as `node2.datoin.com` and `node3.datoin.com`.

ON ALL HOSTS :
  sudo emacs /etc/hostname  # set names appropriately and save it
 sudo hostname `cat /etc/hostname`  # update hostname on the fly

  Example : node2.datoin.com  #node2
                   node3.datoin.com  #node3

2. Network : assign static address
Configure static IP for all the hosts so that IPs remains same over the time.

ON ALL HOSTS :
  sudo emacs /etc/network/interfaces

  Example :
auto em2 # em2 is interface name, it can be eth0, eth1..
iface em2 inet static
address 192.168.1.132   # address for each host should be different
network 192.168.1.0
gateway 192.168.1.1
netmask 255.255.255.0
broadcast 192.168.1.255
dns-nameservers 8.8.8.8 8.8.4.4   # google DNS
up ethtool -s em2 wol g           # if you want to enable wake on lan

3. Configure name resolution between the hosts
We are about to form cluster, we need to introduce hosts in the cluster to each other. Since we are not using any local name server, let’s update hosts entries in each node.

ON ALL HOSTS :
 sudo emacs /etc/hosts  # add the entries for each hosts

Example >>
 192.168.1.131  node2.datoin.com node2
 192.168.1.132  node3.datoin.com node3

Note :
  * there must be one `127.0.0.1 localhost` entry in top
  * Tip : do not map 127.0.x.x to hostname, instead use assigned static IP.


4. Decide Roles for machines based on your cluster setup

Mapping of roles to hosts used in this guide.
   node2 : name-node, job-tracker/resource-manager; data-node, task-tracker, oozie, hue
   node3 : cloudera-manager, data-node, task-tracker, cloudera-agents,  secondary-name-node, zookeeper, mysql

   At first we install and configure cloudera manager and rest everything goes easy in mouse clicks via the manager!

5. Setup MySQL :
  For production cluster, we use mysql db to keep records.
  If you decide to use embedded PostGreSQL for experimentation, then skip this step. If you are planning to use cluster for some serious work with large cluster size MySQL may be a good option.

  5.1 : Install Mysql server on one of the node
     sudo apt-get install mysql-server   # in this example, on node3
     # Provide root user password and make sure to remember it!

  5.2 : Create databases and give access
     ON MYSQL SERVER HOST :
       Launch Mysql Console
        mysql -u root -p

# Cloudera manager db user, database and grant
    create user 'cmf'@'%' identified by 'xyz';
    create database cmf;
    grant all privileges on cmf.* to 'cmf'@'%' identified by 'xyz';

    # For activity monitor
    create user 'amon'@'%' identified by 'xyz’;
    create database amon;
    grant all privileges on amon.* to 'amon'@'%' identified by 'xyz';

    # Hive Meta store
    create user 'hive'@'%' identified by 'xyz';
    create database metastore;
    grant all privileges on metastore.* to 'hive'@'%' identified by 'xyz';

# Flush all changes
FLUSH PRIVILEGES;

    Flush privileges, exit from mysql console.
   Write down user names, databases, and passwords somewhere.

  5.3 Tweak Mysql Server settings

      # backup settings
      sudo cp /etc/mysql/my.cnf /etc/mysql/my.cnf.bak

      # stop Mysql before editing any config,
      sudo service mysql stop
      # remember to start it after wards by 'sudo service mysql start'

     5.3.1 Make Mysql Server accessible from other hosts :
           set bind address to '0.0.0.0'(all interfaces) instead of '127.0.0.0' (loopback)
           sudo emacs /etc/mysql/my.conf
               >>  find line having 'bind-address = 127.0.0.1'
                      and update to 'bind-address = 0.0.0.0'


     5.3.2  Tune MySQL settings for Cloudera
Configure InnoDB as specified here

[mysqld]
# Under mysqld block
read_buffer_size = 2M
read_rnd_buffer_size = 16M
sort_buffer_size = 8M
join_buffer_size = 8M

# InnoDB settings
innodb_file_per_table = 1
innodb_flush_log_at_trx_commit  = 2
innodb_log_buffer_size          = 64M
innodb_buffer_pool_size         = 4G
innodb_thread_concurrency       = 8
innodb_flush_method             = O_DIRECT
innodb_log_file_size = 512M

     5.3.3 Restart Mysqld
        sudo service mysql start

   5.4 Install Mysql Java Connector
       This should be Installed on Hosts that needs
       connection with Mysql Server(such as hosts having cloudera manager, Hive server)
       Install it on all hosts if you are not sure:
         sudo apt-get install libmysql-java

6 Install Cloudera Manager.
   You need one host to act as manager for your cluster.
This host may not necessarily part of the hadoop cluster. Read more at http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM5/latest/Cloudera-Manager-Installation-Guide/cm5ig_intro_to_cm_install.html?scroll=cmig_topic_3_3_unique_1

   ON MANAGER HOST :
    6.1. Fake ubuntu 14.04 as 12.04
         Cloudera has no support for 14.04 yet!(as of June 2014).
         So we are going to install 12.04 packages on 14.04 with little trouble!
         When cloudera provides support for 14.04 by  hosting repositories and builds, this step may not be required.

     ON ALL UBUNTU 14.04 HOSTS :
     6.1.1 Take Backup of release version file
              sudo sudo cp /etc/lsb-release /etc/lsb-release.bak
 # remember to revert back after setup is complete
6.1.2 Modify release file with ubuntu 12.04 LTS contents
              sudo emacs /etc/lsb-release
               # edit it with following content, from 12.04 server
        DISTRIB_ID=Ubuntu
        DISTRIB_RELEASE=12.04
        DISTRIB_CODENAME=precise
        DISTRIB_DESCRIPTION="Ubuntu 12.04.3 LTS"

  6.2 Add cloudera repositories to fetch packages, add key
        sudo curl "http://archive.cloudera.com/cm5/ubuntu/precise/amd64/cm/cloudera.list" -o /etc/apt/sources.list.d/cloudera_precise.list
        curl -s http://archive.cloudera.com/cdh5/ubuntu/precise/amd64/cdh/archive.key | sudo apt-key add -

  6.3 Install Cloudera manager and start it
sudo apt-get update
sudo apt-get install cloudera-manager-daemons \
   cloudera-manager-server

  6.4 Configure DB settings
         sudo emacs /etc/cloudera-scm-server/db.properties
         #update host, db, user, password which are already created previously in 5.2 step.
         In our example,
    com.cloudera.cmf.db.type=mysql
com.cloudera.cmf.db.host=localhost  # host name
com.cloudera.cmf.db.name=cmf
com.cloudera.cmf.db.user=cmf
com.cloudera.cmf.db.password=xyz  # actual password



 6.5  Start Cloudera manager
  sudo service cloudera-scm-server start

In browser, open http://<manager_host>:7180/
If you see login form, then you are good
  if something wrong inspect log at `sudo tailf -100 /var/log/cloudera-scm-server/cloudera-scm-server.log`
   common errors :
            > Driver Class Not found : mysql jdbc connector library not installed! => step 5.4
            > Cant Connection            : Db settings are invalid, =>Step 6.4


6.6 More Fixes/hacks for Ubuntu 14.04

   6.6.1 update-alternatives moved from /usr/sbin/ to /usr/bin/.
      But some of scripts use hard coded paths.
     sudo ln -s /usr/bin/update-alternatives \
/usr/sbin/update-alternatives
   6.6.2 Agents require `ntpdc` which are absent. Install it
     sudo apt-get install ntp
   
   6.6.4 Agents require 'fuse-utils' but it is missing in 14.04 apt repos,
       Needs to find it manually and install it in advance! Get it from Debian repo
             
    ON ALL NODES :
        wget http://ftp.cn.debian.org/debian/pool/main/f/fuse/fuse-utils_2.9.0-2+deb7u1_all.deb
   sudo  dpkg -i fuse-utils_2.9.0-2+deb7u1_all.deb
       
7. Continue Installation through web interface 

Open http://<manager_host>:7180/ in browser and proceed with installation wizard
   7.1 Specify hosts.
        >> node[2-3]
        or >> 192.168.1.13[1-2]
        
        Select hosts and proceed
        
   7.2 Use Recommended Installation :
        Use Parcels and customize services before installation
        
   7.3 Now Manager needs access to all the hosts
       If you already have same user account on all hosts(Example : On Ec2 instance has ‘ubuntu’ account by default, then skip subsequent user account creation step.   Otherwise create new user and provide it to cloudera. It requires account with passwordless sudo to install services, start, stop them.
       
  7.3.1 Create user on all hosts
  sudo adduser cdhusr

 7.3.2 Make cdhusr as password less sudo on all hosts
        
    sudo visudo
         add this line  under entry for root
cdhusr  ALL = (ALL) NOPASSWD: ALL
         save it
to know more, visit  http://serverfault.com/questions/160581/how-to-setup-passwordless-sudo-on-linux

         
 7.3.2 authorize cdhusr with ssh key or the password,
          Lets follow with same password option and provide password and resume installation
          
  7.3.3 If You see red at the end of installation due to no heart beat from cloudera agents, don't worry, its a known issue.

  This happens because the CDH5 agents use older version of python on 14.04.
  Find and replace Python VirtualEnv and update it with newer version of python bin

  ON ALL HOSTS :
# go to virtual env
cd /usr/lib/cmf/agent/build/env/
# take backup
sudo mv  bin/python bin/python.bak
# use the newer version from ubuntu 14.04
sudo cp /usr/bin/python2.7 bin/python
           
      Retry Installation. If you see green, manager installation is complete.

             
8. Deploy cluster
           Cloudera got nice and intuitive UI to deploy services and manage them.  Remember the role assignment plan of  step 4. and make yourself familiar with “Add Cluster”, “Add Service”. This step is not described here as it is mostly performed in Web App UI.


9. Post Installation.
Revert back release version to 14.04.
sudo mv /etc/lsb-release /etc/lsb-release.precise
sudo mv /etc/lsb-release.bak /etc/lsb-release


If you find anything missing, or face any issues, please find comment box to ask them.