Apache Hadoop has rapidly become a leading open source Big Data management software used for all the situations where large scopes of data have to be dealt with. But there are some scenarios when Hadoop is especially useful.
Apache Hadoop has rapidly become a leading open source Big Data management software used for situations where large scopes of data must be dealt with. But there are some scenarios when Hadoop is especially useful:
Complex information processing is needed (But this applies to parallelizable algorithms only.)
Unstructured data processing is needed
It is not critical to have results in real time
Machine learning tasks are involved
Data sets are too large to fit into the RAM or discs; or require too many cores to process them (few TBs and more).
Let`s consider some typical Hadoop usage examples.
Log Files/Click Stream Analysis
Log processing is one of the most common tasks solved by Hadoop. Large data centers can generate gigabytes of logs every single day. These logs may be extremely useful for troubleshooting and analysis, but storage and processing of such volumes of data presents a serious technical challenge. Even a simple search becomes an issue when dealing with such extensive volumes of data. A single machine can't perform such a job effectively, which leads to the necessity of having a cluster of distributed machines to work on the task. This is where Hadoop becomes useful as it provides tools to store (HDFS) and process (MapReduce) large volumes of log files.
Recommendation Engines/Ad Targeting
In order to suggest the most appropriate ads for a given user, it`s necessary to analyze all the available information about the users (profile, web browsing history, clicks history, email headers to, etc.). The recommendations engine should understand behavior and preferences of each individual user and be able to estimate the probability of the user’s interest in each specific ad.
Another type of similar analysis could be described by use cases when an organization will process all the available data from multiple data sources related to each customer in order to understand: what could be done (campaigns, benefit programs, personal treatment, etc.) to increase customer satisfaction?
Such tasks can be effectively solved by Hadoop as it allows for analyzing different parts of data (e.g. data about each customer) separately and in parallel.
Search Index Generation
This is where Hadoop was originally started as an attempt to build an open source search engine. After Google published their GFS (Google's distributed File System) and MapReduce papers, Hadoop was designed and implemented according to the design principles defined in these papers.
The problem was to effectively store and process extremely large volumes of crawled data. The main result of this processing is a generated search index. Each individual piece of data (e.g. crawled web site) could be processed separately from other parts of data. But there was a need in the infrastructure and programming model to store and process huge volumes of data, and this is where HDFS/MapReduce were designed to complete the task.
Hadoop as a scalable, fault-tolerant and distributed data storage and processing ecosystem consists of different components, each designed for a particular purpose. In this post I`ll address four of them, namely HDFS, MapReduce, HBase, and Hive; and I will also discuss what types of tasks each of these component is best suited for.
HDFS is specifically tuned to support large files, gigabytes to terabytes in size. Thus, if you need to store an extra large number of smaller files (small images, for instance), HDFS should not be your first (or best) choice. This task of storing large number of small files can be much more effectively performed using HBase.
MapReduce framework was primarily designed to allow parallel processing of large data sets. It is best suited for the scenarios when a data set can be split into separate subsets processed and analyzed independently from each other. The results of such individual data subsets processing are then aggregated into a general result.
MapReduce is not effective enough for the scenarios demanding the ability to analyze a specific subset of data processing, where the algorithm should communicate closely with other computing nodes which are processing other subsets of data. For such cases other cloud computing infrastructures are more appropriate.
HBase is a "NoSQL" column-oriented data store, so it may be the best choice for use cases where "NoSQL" data stores are necessary: namely, when scalability and flexible schema is most important, and there are no strict requirements for ACID guarantees and SQL support.
HBase is integrated with Hadoop MapReduce, meaning that HBase can serve as a data source and target data storage for MapReduce jobs. This makes HBase a good candidate for the tasks where the power of NoSQL data stores needs to be combined with MapReduce.
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. It also allows the usage of traditional MapReduce programs to be plugged in.
But Hive is based on MapReduce which is not generally suitable for low-latency requests. Hive parses HiveQL query, creates map/reduce application based on parsing results, and executes it. So it can't be considered as a replacement for traditional RDBM systems. SQL-like language is used for convenience mostly, but not as an RDBMS replacement.
The possibilities that Hadoop provides to your business may be immense, but the effectiveness and real value that you receive will largely depend on whether its many components are correctly applied to the tasks they are best suited for.
By Serhiy Verovka Senior Solutions Architect
31 Jul 2012