What is Apache Hive?

Perform SQL-like queries on petabytes of data

What is Hive?

Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale. A data warehouse provides a central store of information that can easily be analyzed to make informed, data driven decisions. Hive allows users to read, write, and manage petabytes of data using SQL.

Hive is built on top of Apache Hadoop, which is an open-source framework used to efficiently store and process large datasets. As a result, Hive is closely integrated with Hadoop, and is designed to work quickly on petabytes of data. What makes Hive unique is the ability to query large datasets, leveraging Apache Tez or MapReduce, with a SQL-like interface.

How does Hive work?

Hive was created to allow non-programmers familiar with SQL to work with petabytes of data, using a SQL-like interface called HiveQL. Traditional relational databases are designed for interactive queries on small to medium datasets and do not process huge datasets well. Hive instead uses batch processing so that it works quickly across a very large distributed database. Hive transforms HiveQL queries into MapReduce or Tez jobs that run on Apache Hadoop’s distributed job scheduling framework, Yet Another Resource Negotiator (YARN). It queries data stored in a distributed storage solution, like the Hadoop Distributed File System (HDFS) or Amazon S3. Hive stores its database and table metadata in a metastore, which is a database or file backed store that enables easy data abstraction and discovery.

Hive includes HCatalog, which is a table and storage management layer that reads data from the Hive metastore to facilitate seamless integration between Hive, Apache Pig, and MapReduce. By using the metastore, HCatalog allows Pig and MapReduce to use the same data structures as Hive, so that the metadata doesn’t have to be redefined for each engine. Custom applications or third party integrations can use WebHCat, which is a RESTful API for HCatalog to access and reuse Hive metadata.

Benefits of Hive

Fast

Hive is designed to quickly handle petabytes of data using batch processing.

Familiar

Hive provides a familiar, SQL-like interface that is accessible to non-programmers.

Scalable

Hive is easy to distribute and scale based on your needs.

Apache Hive vs Apache HBase

Apache HBase is a NoSQL distributed database that enables random, strictly consistent, real-time access to petabytes of data. Apache Hive is a distributed data warehouse system that provides SQL-like querying capabilities.

Characteristics	Apache Hive	Apache HBase
Function	SQL-like query engine designed for high volume data stores. Multiple file-formats are supported.	Low-latency distributed key-value store with custom query capabilities. Data is stored in a column-oriented format.
Processing Type	Batch processing using Apache Tez or MapReduce compute frameworks.	Real-time processing.
Latency	Medium to high, depending on the responsiveness of the compute engine. The distributed execution model provides superior performance compared to monolithic query systems, like RDBMS, for the same data volumes.	Low, but it can be inconsistent. Structural limitations of the HBase architecture can result in latency spikes under intense write loads.
Hadoop Integration	Runs on top of Hadoop, with Apache Tez or MapReduce for processing and HDFS or Amazon S3 for storage.	Runs on top of HDFS or Amazon S3.
SQL Support	Provides SQL-like querying capabilities with HiveQL.	No SQL support on its own. You can use Apache Phoenix for SQL capabilities.
Schema	Defined schema for all tables.	Schema-free.
Data Types	Supports structured and unstructured data. Provides native support for common SQL data types, like INT, FLOAT, and VARCHAR.	Supports unstructured data only. The user defines mappings of data fields to Java-supported data types.

Hive Use Cases

Airbnb connects people with places to stay and things to do around the world with 2.9 million hosts listed, supporting 800k nightly stays. Airbnb uses Amazon EMR to run Apache Hive on a S3 data lake. Running Hive on the EMR clusters enables Airbnb analysts to perform ad hoc SQL queries on data stored in the S3 data lake. By migrating to a S3 data lake, Airbnb reduced expenses, can now do cost attribution, and increased the speed of Apache Spark jobs by three times their original speed.

Guardian gives 27 million members the security they deserve through insurance and wealth management products and services. Guardian uses Amazon EMR to run Apache Hive on a S3 data lake. Apache Hive is used for batch processing. The S3 data lake fuels Guardian Direct, a digital platform that allows consumers to research and purchase both Guardian products and third party products in the insurance sector.

FINRA – the Financial Industry Regulatory Authority – is the largest independent securities regulator in the United States, and monitors and regulates financial trading practices. FINRA uses Amazon EMR to run Apache Hive on a S3 data lake. Running Hive on the EMR clusters enables FINRA to process and analyze trade data of up to 90 billion events using SQL. The cloud data lake resulted in cost savings of up to $20 million compared to FINRA’s on-premises solution, and drastically reduced the time needed for recovery and upgrades.

Vanguard, an American registered investment advisor, is the largest provider of mutual funds and the second largest provider of exchange traded funds. Vanguard uses Amazon EMR to run Apache Hive on a S3 data lake. Data is stored in S3 and EMR builds a Hive metastore on top of that data. The Hive metastore contains all the metadata about the data and tables in the EMR cluster, which allows for easy data analysis. Hive also enables analysts to perform ad hoc SQL queries on data stored in the S3 data lake. Migrating to a S3 data lake with Amazon EMR has enabled 150+ data analysts to realize operational efficiency and has reduced EC2 and EMR costs by $600k.

Hive and Hadoop on AWS

Amazon EMR provides the easiest, fastest, and most cost-effective managed Hadoop framework, enabling customers to process vast amounts of data across dynamically scalable EC2 instances. Customers can also run other popular distributed frameworks such as Apache Hive, Spark, HBase, Presto, and Flink in EMR. Learn more about Amazon EMR.