July 22, 2009

Open source Hive: Large-scale, distributed data processing made easy

Thank heaven for Hive, a data analysis and query front end for Hadoop that makes Hadoop data files look like SQL tables

Bottom Line
Apache Hive is a specialized execution front end for Hadoop. Hive lets you write data queries in an SQL-like language -- the Hive Query Language (HQL) -- that are converted to map/reduced tasks, which are then executed by the Hadoop framework. You're using Hadoop, but it feels like you're talking SQL to an RDBMS.

Suppose you want to run regular statistical analyses on your Web site's traffic log data -- several hundred terabytes, updated weekly. (Don't laugh. This is not unheard of for popular Web sites.) You're already familiar with Hadoop (see InfoWorld's review), the open source distributed processing system that would be ideal for this task. But you don't have time to code Hadoop map/reduce functions? Perhaps you're not the elite programmer that everyone in the office thinks you are.

What you'd like to do is dump all that information into a database, and execute a set of SQL queries on it. But the quantity of data would overwhelm even an enterprise-level RDBMS.

[ Read the InfoWorld Test Center's hands-on account of working with Amazon Elastic MapReduce and Amazon Web Services. | Keep abreast of cloud computing news by visiting InfoWorld's Cloud Computing channel. ]

This is precisely the problem that engineers at Facebook encountered. They became interested in Hadoop as a means of processing their Web site's traffic data that was generating terabytes per day, was growing, and was overtaxing their Oracle database. Though they were happy with Hadoop, they wanted to simplify its use so that engineers could express frequently used analysis operations in SQL. The resulting Hadoop-based data warehouse application became Hive, and it helps to process more than 10TB of Facebook data daily. Now Hive is available as an open source subproject of Apache Hadoop.

Inside the Hive
Written in Java, Hive is a specialized execution front end for Hadoop. Hive lets you write data queries in an SQL-like language -- the Hive Query Language (HQL) -- that are converted to map/reduced tasks, which are then executed by the Hadoop framework. You're using Hadoop, but it feels like you're talking SQL to an RDBMS.

Employing Hadoop's distributed file system (HDFS) as data storage, Hive inherits all of Hadoop's fault tolerance, scalability, and adeptness with huge data sets. When you run Hive, you are deposited into a shell, within which you can execute Hive Data Definition Language (DDL) and HQL commands. A future version of Hive will include JDBC and ODBC drivers, at which time you will be able to create fully executable "Hive applications" in much the same way that you can write a Java database application for your favorite RDBMS. (The current version of Hive -- 0.3.0 -- does have limited support for JDBC, but can only dispatch queries and fetch results.)

To install Hive, you simply install Hadoop and add a couple of download and configuration steps. (To install Hadoop, the best tutorial I've found is on Michael Noll's blog.) Or if you'd rather just get straight to testing Hive without all the installation nonsense, you can download a VMware virtual machine image with Hadoop and Hive pre-installed. The virtual machine image is featured in an excellent Hive tutorial video available at the same Web site.

Although Hive query language (HQL) commands are usually executed from within the Hive shell, you can launch the Hive Web Interface service and run HQL queries from within a browser. You can start multiple queries, and the Web interface will let you monitor the status of each.

Close

On Twitter now

Development platforms

Powered by Twitter

White Paper

D2D Virtual Tape Library Replication Primer

This whitepaper explains the terminology and concepts behind Data Replication technologies and establishes some sizing rules through worked examples. Learn the new paradigm in disaster tolerance—protect data anywhere.

Download now »

White Paper

An Alternative to Virtualization for Datacenter Cost Savings

Server virtualization is a popular option for dealing with mounting datacenter costs. Another equally promising approach is the use of an Application Delivery Controller. Citrix NetScaler provides a low-cost way for organizations to reduce their server count and accrue cost savings from a reduction in space, cooling, power and personnel.

Download now »

White Paper

Why Your Firewall, VPN, and IEEE 802.11i Aren't Enough to Protect Your Network

The emergence of WLANs has created a new breed of security threats to enterprise networks.

Included in HP ProCurve WLAN solutions is security technology that alleviates threats from WLANs through:
* Monitoring wireless activity inside and out of the enterprise
* Classifying WLAN transmissions into harmful and harmless
* Preventing transmissions that pose a security threat to the enterprise network
* Locating participating devices for physical remediation

Download now »

White Paper

Bringing the Edge to the Data Center

Effectively address data protection challenges, implementing solutions that help store and protect business–critical data while cutting costs and improving efficiency and reliability.

Download now »
jsensarma 23-Jul-09 3:56pm
Great article Rick! one minor correction. we load 15TB of new uncompressed data into Hive each day at Facebook. but the amount of data processed exceeds 100TB of compressed data. also - Hive exposes map-reduce programming paradigm directly to users as well. users can specify map and reduce scripts in the language of their choice and chain them using the 'CLUSTER BY' clause. this allows Hive to be used for processing of unstructured (textual) data as well (for example language recognition, sentiment analysis)
Louis 19-Sep-09 11:27pm
if you want to buy replica louis vuitton handbags, you have to visit buy louis vuitton to take a look at the latest louis vuitton store which is also the best louis vuitton wholesaleas I know.

Sign up to receive InfoWorld Resource Alerts

Subscribe to the Cloud Computing Report Newsletter

Receive the latest cloud computing news and find out how to deliver it more effectively.

©1994-2009 Infoworld, Inc.