Thursday, April 30, 2015

Guiding your Product Development through Voice of the Customer

“Great companies are built on great products.”    
- Elon Musk

Elon Musk was spot on about the importance of product development. Though many people may misunderstand a great product to an aesthetically pleasing product. The beauty of great product does not just lie in aesthetics, but rather on how elegantly the product works. It needs a right vision, the purpose and revolutionary changes. As Steve Jobs once said,

"Design is the fundamental soul of a man-made creation that ends up expressing itself in successive outer layers of the product or service. The iMac is not just the color or translucence or the shape of the shell. The essence of the iMac is to be the finest possible consumer computer in which each element plays together."

As much as revolutionary ideas are important to make any product a breakthrough invention, it is the successive iterative improvement which makes the product more usable, finely tuned and each element within plays together. Even if the product is first to market, it is the iterative improvement which helps an organisation to surge ahead in competitions. In order to improve, one needs to enable the feedback loop from the product users to product designers and developers. It is true even when the product is under development or product out in the market. While the quality assurance team, beta users helps under development product team, it is the real users who help in the development  of the product in the later stage. The objective of this article is to discuss the second part of this process.

Brief History

Iterative product development is not a new idea. It is natural that almost all things are built, developed over multiple iterations. Each iteration eliminates the bad and incorporates possible good into the things in concern. It is evolutionary in nature. Successive iteration done over a period of time changes the product completely from its inception. Compare the first digital computer ever developed to a smartphone of today's! Although they both are indeed digital computers, their purpose and uses are completely different. These are the improvement or rather changes that are made by considering some use cases which are called as functions and the value derived from it. Thus, I can arguably say that structured product development began during the World War II and goes by the name ‘Value Engineering’. Quality function deployment is another notable methodology which tries to make use of user’s feedback.

Changing Technology Landscape

With every new disruptive technology, it radically changes how we do the business. In a pre-era of Internet, the feedback was captured through customer surveys. These surveys collected over a period of time, are analyzed and feedback is incorporated into new improved version of any product. This process used to span over months and involved exhaustive human resources. The very same survey conducted offline can be done online and with very fewer resources. Each advancement in computing technology and adoption of it in the business process makes the process increasingly manageable. Meanwhile, the widespread adoption of Internet and new ways of communication are also changing behaviour of the customers. Now the agony of customer are not written and mailed through the post, rather it may be tweeted or commented on the wall of facebook page. Once quiet customer is now becoming more vocal. Their problems are not only sent to the organisation but also broadcasted to millions of other customers. This may affect adversely to any business if not handled properly.  Today Millions of customer support tele-calls are transcribed every day. The information explosion is real.  But to our relief with the information explosion, we are also better equipped to handle such amount of information and harness the power of data.


Like every automation technology, advancement in Natural Language Processing and ability to handle the large amount of data using distributed computing reduces the resources spent in processing customer feedback and increase the effectiveness of such system. This will eventually shorten the time to market or product development iteration time and help organisations to outdo their competitors. Now, we will be able to process these customer feedback written in free flowing text, extract relevant information, structure and organise it according to the specification. Later these extracted information is converted into functions and prioritised by considering how many people are facing it and various other ranking parameters. One can even go an extent of quantifying how emotionally customers are connected to issues and prioritise them.  One more important aspect of customer's voice is that finding the "unknown unknown". It is now easier to mine customer suggestion and incorporate new functions which are never considered by core development team. This helps any organisation to innovate with customer acceptance validations.

Although, these technologies helps us to crunch the large amount of data, extract meaning, organise them, the automation has no use if it is not put into use in the process of product development(PDP). Every organisation has their own processes even though they are similar in nature. These processes which are needs to be adapted to use the Voice of the Customer effectively.

Let us go through an example of how to use the voice of customers to improve a product. Suppose you are a smartphone manufacturer and you have a flagship phone received well by top critics and most users. Your sales figures are good. But you would like to listen carefully what users says about the phone on social media and reviews site to identify what are the top most problems which you might want to fix. To get to there, you list all credible reviews site where your phone is reviewed. Then grab all the reviews, split them by sentence, score each sentence with its sentiment, extract key phrase which are features in your phone which appeared negative sentences. Once you have got features, rank them by how many people are facing the problem and how important the issue is. Then the successive course of action is to prioritize customer care to handle such issues better and to figure out what can be done to fix such issues. In essence, ranking is a major tool to optimize.

What’s Next?

Much of the core technology parts used in processing customer feedback such as ‘sentiment  analysis’, ‘feature extraction’, ‘ranking’ can be reused in much larger Customer Experience Management, Brand Management domains. Also, these feedback can be quantified to aid strategic decision-making capabilities of any organisation. A complete elegantly integrated system will certainly help any organisation to serve their customer needs and adapt to changing landscape.

I will be primarily writing about CEM, Brand Management and Voice of the Customer in this space. We at Datoin are developing those very core technologies that can assist in building a complete suite of applications which solves the aforementioned problems. See you soon with something new to talk about. Stay Tuned!

A data pipeline for the Internet era

Hold your breath, count three, two, one and release. Congratulations! You just spent three seconds of your life while you did it. Guess what all happened on internet in between?
  • How many new tweets got tweeted on Twitter?
  • How many posts got posted on Facebook, Google+?
  • How many new videos got uploaded to Youtube
  • How many new images to Instagram?
  • How many articles got posted in WordPress, Blogger?
  • How many questions asked on Quora and StackExchange sites, how many answers are addressed to those questions?

While the above questions seem to be rhetorical, however trying to find approximate numbers gives the sense of how speedy the online world is. New users are being added to the Empire of Internet, as we are moving towards the connection of every person and thing on this planet to the Internet. We are seeing entrepreneurs who have facilitated freedom to express and share our opinions online. Many varieties of devices are invented from time to time which are always ready to consume our content. I don't need to move myself near a big fat machine, wait for minutes to turn that thing on and connect to the Internet to write and share my opinions; the smart-phones and tablets of today are always on, always connected, eagerly waiting, hungry to consume what I generate.

Consequently, the volume of Information is proliferating! The other part of the story - how do we deal with such a huge volume of information? As an old proverb says - "Necessity is the mother of invention", the necessity to deal with huge volume of information lead to the creation of amazing frameworks. Thanks to the engineers who made up mind to decipher the hidden hints to bring the solutions. 'Though not every problem is solved, at least not yet', but some good souls have contributed their creation by open sourcing, so anyone can explore and improve on it. As a result, a lot has been changed in the past decade(2005 - 2015). If you happened to be stuck on some text analysis operation such as clustering a few millions of documents a decade ago, it would have been a difficult situation. It's a different game altogether today as your current toolbox is equipped with a plethora of capable tools. If I have to mention one and just one tool suit, I opt Apache Hadoop.

Though we are aware of the story of Hadoop emerging out of the platform for running a distributed crawler(Apache Nutch), the way it walked in the past years is astonishing. It has evolved to a state where it can manage thousands of nodes to deal with petabytes of data without worrying about what application you run on top and how you run. Yes, It is the defacto big data operating system. It has evolved into a prominent ecosystem. Apart from the default filesystem, the HDFS, we see a variety of data persisting solutions, each crafted to provide a missing functionality or supersede its precursors. Whether we need to store content in a sequentially accessible file for processing the whole file in a batch (HDFS SequenceFile), a data store for random, real-time read/write access (Hbase), if we like SQL-like warehouse(Hive) - based on application requirement - we have got one!

Just like storage services witnessed richness in features, the computing part too moved on, it is not limited to plain assembly instructions of distributed computing - simple map-sort-partition-reduce, we have got the high-level statements built using these assembly instructions! We have found Pig to script the tasks. We have seen Oozie workflows to connect the stubbornly independent steps. I was amazed when I tried to rebuild Oozie workflow using Apache Tez's DAG at runtime. It's a monsoon season(/party time) for data scientists!

Let us see how smart people are riding the wave by harnessing the power of the big-yellow-elephant to analyze inundant data. Here is a huge list of organizations which have put Hadoop to work. Some of them have contributed back by fixing issues, adding and perfectioning features, developing better tools. As a result, a lot got shared across organizations by active participation. This give-and-take business is not just for the organizations, but also the computing disciplines are bartering in another way. For instance, Machine Learning + Natural Language Processing is complementing Big Data and vice versa.

The currency is not the data itself, but the information hid inside is! What's the use of data if we do not have the luxury of analytics? How effective is analytics if we do not have visualizations to grasp in a minute or less? As the majority of the Internet content is natural text penned by humans, we definitely seek natural language processing and machine learning to get the insights. On the other hand, some of the complex natural language processing problems which demanded enormous data to employ machine learning solutions are now more accurate as we got more data to feed the learning algorithms. As people say - It is the best time.

If you are stuck with the document clustering problem that I mentioned earlier, it is no more a difficult situation to cluster a twelve-plus-digit number of documents. You would probably play with algorithms of apache mahout or apache spark and run k-means. What if your analysis requires a sequence of tasks such as web crawling, extraction, sentiment analysis and visualization? You are going to form a data pipeline for carrying out all the steps in a sequence.

In essence, pipeline processing has a potential to tackle complicated tasks at the Internet scale. We have built a pipeline processing platform to assemble the components (a piece of software which solves a simple task) to make useful applications and run these applications on a cluster of nodes. For instance, if you wish to mine what users on Internet are speaking about specific brands, you can:
  • you need to gather data from world wide web; just grab a crawler component and configure the sources.
  • Grab an extractor and connect to the crawler. Specify all the fields which you wish to extract.
  • Grab a Sentiment Analysis module and connect it the extractor.
  • Optionally, grab and connect an aggregator module for aggregation for sentiments.
There you go, a pipeline will be ready to process data from websites. Visit Datoin and build your first pipeline application.

Thursday, June 19, 2014

How to setup Cloudera Manager 5.x and Hadoop on Ubuntu 14.04 LTS servers ?

When you are told to setup a tool you will make a choice to use latest stable version. With this motto, recently when we were setting up our new cluster, we choose ubuntu 14.04 LTS servers for our infrastructure.

Setting up hadoop and monitoring them turns out to be cumbersome for new users. Cloudera seems to be doing great service by making it simpler.

Being faced with situation of not having official support from cloudera for ubuntu 14.04, we managed to hack few steps and got it working.

This post describes,
 * configuring hosts and name resolutions
 * setting up of cloudera manager on ubuntu 14.04 and hacks to make it working
 * Configuring MySQL database for Cloudera daemons’ bookkeeping  
 * Deploying 2 node hadoop cluster with cloudera manager

Before you begin:
 * emacs is used for editing the configurations.
    To install emacs :
       sudo apt-get update && sudo apt-get install emacs24-nox
    => use vi or vim if you're not familiar with emacs
* ssh is used through this for login to hosts, and login credentials are assumed.

1. Set Host names :
Configuring hostnames properly is very essential.
Lets name two hosts as `` and ``.

  sudo emacs /etc/hostname  # set names appropriately and save it
 sudo hostname `cat /etc/hostname`  # update hostname on the fly

  Example :  #node2

2. Network : assign static address
Configure static IP for all the hosts so that IPs remains same over the time.

  sudo emacs /etc/network/interfaces

  Example :
auto em2 # em2 is interface name, it can be eth0, eth1..
iface em2 inet static
address   # address for each host should be different
dns-nameservers   # google DNS
up ethtool -s em2 wol g           # if you want to enable wake on lan

3. Configure name resolution between the hosts
We are about to form cluster, we need to introduce hosts in the cluster to each other. Since we are not using any local name server, let’s update hosts entries in each node.

 sudo emacs /etc/hosts  # add the entries for each hosts

Example >> node2 node3

Note :
  * there must be one ` localhost` entry in top
  * Tip : do not map 127.0.x.x to hostname, instead use assigned static IP.

4. Decide Roles for machines based on your cluster setup

Mapping of roles to hosts used in this guide.
   node2 : name-node, job-tracker/resource-manager; data-node, task-tracker, oozie, hue
   node3 : cloudera-manager, data-node, task-tracker, cloudera-agents,  secondary-name-node, zookeeper, mysql

   At first we install and configure cloudera manager and rest everything goes easy in mouse clicks via the manager!

5. Setup MySQL :
  For production cluster, we use mysql db to keep records.
  If you decide to use embedded PostGreSQL for experimentation, then skip this step. If you are planning to use cluster for some serious work with large cluster size MySQL may be a good option.

  5.1 : Install Mysql server on one of the node
     sudo apt-get install mysql-server   # in this example, on node3
     # Provide root user password and make sure to remember it!

  5.2 : Create databases and give access
       Launch Mysql Console
        mysql -u root -p

# Cloudera manager db user, database and grant
    create user 'cmf'@'%' identified by 'xyz';
    create database cmf;
    grant all privileges on cmf.* to 'cmf'@'%' identified by 'xyz';

    # For activity monitor
    create user 'amon'@'%' identified by 'xyz’;
    create database amon;
    grant all privileges on amon.* to 'amon'@'%' identified by 'xyz';

    # Hive Meta store
    create user 'hive'@'%' identified by 'xyz';
    create database metastore;
    grant all privileges on metastore.* to 'hive'@'%' identified by 'xyz';

# Flush all changes

    Flush privileges, exit from mysql console.
   Write down user names, databases, and passwords somewhere.

  5.3 Tweak Mysql Server settings

      # backup settings
      sudo cp /etc/mysql/my.cnf /etc/mysql/my.cnf.bak

      # stop Mysql before editing any config,
      sudo service mysql stop
      # remember to start it after wards by 'sudo service mysql start'

     5.3.1 Make Mysql Server accessible from other hosts :
           set bind address to ''(all interfaces) instead of '' (loopback)
           sudo emacs /etc/mysql/my.conf
               >>  find line having 'bind-address ='
                      and update to 'bind-address ='

     5.3.2  Tune MySQL settings for Cloudera
Configure InnoDB as specified here

# Under mysqld block
read_buffer_size = 2M
read_rnd_buffer_size = 16M
sort_buffer_size = 8M
join_buffer_size = 8M

# InnoDB settings
innodb_file_per_table = 1
innodb_flush_log_at_trx_commit  = 2
innodb_log_buffer_size          = 64M
innodb_buffer_pool_size         = 4G
innodb_thread_concurrency       = 8
innodb_flush_method             = O_DIRECT
innodb_log_file_size = 512M

     5.3.3 Restart Mysqld
        sudo service mysql start

   5.4 Install Mysql Java Connector
       This should be Installed on Hosts that needs
       connection with Mysql Server(such as hosts having cloudera manager, Hive server)
       Install it on all hosts if you are not sure:
         sudo apt-get install libmysql-java

6 Install Cloudera Manager.
   You need one host to act as manager for your cluster.
This host may not necessarily part of the hadoop cluster. Read more at

    6.1. Fake ubuntu 14.04 as 12.04
         Cloudera has no support for 14.04 yet!(as of June 2014).
         So we are going to install 12.04 packages on 14.04 with little trouble!
         When cloudera provides support for 14.04 by  hosting repositories and builds, this step may not be required.

     ON ALL UBUNTU 14.04 HOSTS :
     6.1.1 Take Backup of release version file
              sudo sudo cp /etc/lsb-release /etc/lsb-release.bak
 # remember to revert back after setup is complete
6.1.2 Modify release file with ubuntu 12.04 LTS contents
              sudo emacs /etc/lsb-release
               # edit it with following content, from 12.04 server
        DISTRIB_DESCRIPTION="Ubuntu 12.04.3 LTS"

  6.2 Add cloudera repositories to fetch packages, add key
        sudo curl "" -o /etc/apt/sources.list.d/cloudera_precise.list
        curl -s | sudo apt-key add -

  6.3 Install Cloudera manager and start it
sudo apt-get update
sudo apt-get install cloudera-manager-daemons \

  6.4 Configure DB settings
         sudo emacs /etc/cloudera-scm-server/
         #update host, db, user, password which are already created previously in 5.2 step.
         In our example,
    com.cloudera.cmf.db.type=mysql  # host name
com.cloudera.cmf.db.password=xyz  # actual password

 6.5  Start Cloudera manager
  sudo service cloudera-scm-server start

In browser, open http://<manager_host>:7180/
If you see login form, then you are good
  if something wrong inspect log at `sudo tailf -100 /var/log/cloudera-scm-server/cloudera-scm-server.log`
   common errors :
            > Driver Class Not found : mysql jdbc connector library not installed! => step 5.4
            > Cant Connection            : Db settings are invalid, =>Step 6.4

6.6 More Fixes/hacks for Ubuntu 14.04

   6.6.1 update-alternatives moved from /usr/sbin/ to /usr/bin/.
      But some of scripts use hard coded paths.
     sudo ln -s /usr/bin/update-alternatives \
   6.6.2 Agents require `ntpdc` which are absent. Install it
     sudo apt-get install ntp
   6.6.4 Agents require 'fuse-utils' but it is missing in 14.04 apt repos,
       Needs to find it manually and install it in advance! Get it from Debian repo
   sudo  dpkg -i fuse-utils_2.9.0-2+deb7u1_all.deb
7. Continue Installation through web interface 

Open http://<manager_host>:7180/ in browser and proceed with installation wizard
   7.1 Specify hosts.
        >> node[2-3]
        or >>[1-2]
        Select hosts and proceed
   7.2 Use Recommended Installation :
        Use Parcels and customize services before installation
   7.3 Now Manager needs access to all the hosts
       If you already have same user account on all hosts(Example : On Ec2 instance has ‘ubuntu’ account by default, then skip subsequent user account creation step.   Otherwise create new user and provide it to cloudera. It requires account with passwordless sudo to install services, start, stop them.
  7.3.1 Create user on all hosts
  sudo adduser cdhusr

 7.3.2 Make cdhusr as password less sudo on all hosts
    sudo visudo
         add this line  under entry for root
         save it
to know more, visit

 7.3.2 authorize cdhusr with ssh key or the password,
          Lets follow with same password option and provide password and resume installation
  7.3.3 If You see red at the end of installation due to no heart beat from cloudera agents, don't worry, its a known issue.

  This happens because the CDH5 agents use older version of python on 14.04.
  Find and replace Python VirtualEnv and update it with newer version of python bin

# go to virtual env
cd /usr/lib/cmf/agent/build/env/
# take backup
sudo mv  bin/python bin/python.bak
# use the newer version from ubuntu 14.04
sudo cp /usr/bin/python2.7 bin/python
      Retry Installation. If you see green, manager installation is complete.

8. Deploy cluster
           Cloudera got nice and intuitive UI to deploy services and manage them.  Remember the role assignment plan of  step 4. and make yourself familiar with “Add Cluster”, “Add Service”. This step is not described here as it is mostly performed in Web App UI.

9. Post Installation.
Revert back release version to 14.04.
sudo mv /etc/lsb-release /etc/lsb-release.precise
sudo mv /etc/lsb-release.bak /etc/lsb-release

If you find anything missing, or face any issues, please find comment box to ask them.