Thursday, April 30, 2015

A data pipeline for the Internet era

Hold your breath, count three, two, one and release. Congratulations! You just spent three seconds of your life while you did it. Guess what all happened on internet in between?
  • How many new tweets got tweeted on Twitter?
  • How many posts got posted on Facebook, Google+?
  • How many new videos got uploaded to Youtube
  • How many new images to Instagram?
  • How many articles got posted in WordPress, Blogger?
  • How many questions asked on Quora and StackExchange sites, how many answers are addressed to those questions?

While the above questions seem to be rhetorical, however trying to find approximate numbers gives the sense of how speedy the online world is. New users are being added to the Empire of Internet, as we are moving towards the connection of every person and thing on this planet to the Internet. We are seeing entrepreneurs who have facilitated freedom to express and share our opinions online. Many varieties of devices are invented from time to time which are always ready to consume our content. I don't need to move myself near a big fat machine, wait for minutes to turn that thing on and connect to the Internet to write and share my opinions; the smart-phones and tablets of today are always on, always connected, eagerly waiting, hungry to consume what I generate.

Consequently, the volume of Information is proliferating! The other part of the story - how do we deal with such a huge volume of information? As an old proverb says - "Necessity is the mother of invention", the necessity to deal with huge volume of information lead to the creation of amazing frameworks. Thanks to the engineers who made up mind to decipher the hidden hints to bring the solutions. 'Though not every problem is solved, at least not yet', but some good souls have contributed their creation by open sourcing, so anyone can explore and improve on it. As a result, a lot has been changed in the past decade(2005 - 2015). If you happened to be stuck on some text analysis operation such as clustering a few millions of documents a decade ago, it would have been a difficult situation. It's a different game altogether today as your current toolbox is equipped with a plethora of capable tools. If I have to mention one and just one tool suit, I opt Apache Hadoop.

Though we are aware of the story of Hadoop emerging out of the platform for running a distributed crawler(Apache Nutch), the way it walked in the past years is astonishing. It has evolved to a state where it can manage thousands of nodes to deal with petabytes of data without worrying about what application you run on top and how you run. Yes, It is the defacto big data operating system. It has evolved into a prominent ecosystem. Apart from the default filesystem, the HDFS, we see a variety of data persisting solutions, each crafted to provide a missing functionality or supersede its precursors. Whether we need to store content in a sequentially accessible file for processing the whole file in a batch (HDFS SequenceFile), a data store for random, real-time read/write access (Hbase), if we like SQL-like warehouse(Hive) - based on application requirement - we have got one!

Just like storage services witnessed richness in features, the computing part too moved on, it is not limited to plain assembly instructions of distributed computing - simple map-sort-partition-reduce, we have got the high-level statements built using these assembly instructions! We have found Pig to script the tasks. We have seen Oozie workflows to connect the stubbornly independent steps. I was amazed when I tried to rebuild Oozie workflow using Apache Tez's DAG at runtime. It's a monsoon season(/party time) for data scientists!

Let us see how smart people are riding the wave by harnessing the power of the big-yellow-elephant to analyze inundant data. Here is a huge list of organizations which have put Hadoop to work. Some of them have contributed back by fixing issues, adding and perfectioning features, developing better tools. As a result, a lot got shared across organizations by active participation. This give-and-take business is not just for the organizations, but also the computing disciplines are bartering in another way. For instance, Machine Learning + Natural Language Processing is complementing Big Data and vice versa.

The currency is not the data itself, but the information hid inside is! What's the use of data if we do not have the luxury of analytics? How effective is analytics if we do not have visualizations to grasp in a minute or less? As the majority of the Internet content is natural text penned by humans, we definitely seek natural language processing and machine learning to get the insights. On the other hand, some of the complex natural language processing problems which demanded enormous data to employ machine learning solutions are now more accurate as we got more data to feed the learning algorithms. As people say - It is the best time.

If you are stuck with the document clustering problem that I mentioned earlier, it is no more a difficult situation to cluster a twelve-plus-digit number of documents. You would probably play with algorithms of apache mahout or apache spark and run k-means. What if your analysis requires a sequence of tasks such as web crawling, extraction, sentiment analysis and visualization? You are going to form a data pipeline for carrying out all the steps in a sequence.

In essence, pipeline processing has a potential to tackle complicated tasks at the Internet scale. We have built a pipeline processing platform to assemble the components (a piece of software which solves a simple task) to make useful applications and run these applications on a cluster of nodes. For instance, if you wish to mine what users on Internet are speaking about specific brands, you can:
  • you need to gather data from world wide web; just grab a crawler component and configure the sources.
  • Grab an extractor and connect to the crawler. Specify all the fields which you wish to extract.
  • Grab a Sentiment Analysis module and connect it the extractor.
  • Optionally, grab and connect an aggregator module for aggregation for sentiments.
There you go, a pipeline will be ready to process data from websites. Visit Datoin and build your first pipeline application.

No comments:

Post a Comment