Hive (SQL-style) Query Language

Treasure Data is an analytics infrastructure as a service. We provide a SQL-like query language interface called the Hive query language. This article covers the basics of the Hive query language.

Table of Contents

Get the extended Hive Guide as PDF!

Download Your Hive Guide

About Apache Hive

The Hive query language (HiveQL) is the primary data processing method for Treasure Data. HiveQL is powered by Apache Hive. Treasure Data is a cloud data platform that allows users to collect, store, and analyze their data on the cloud. Treasure Data manages its own Hadoop cluster, which accepts queries from users and executes them using the Hadoop MapReduce framework. HiveQL is one of the languages it supports.

Hive Example Query Catalog

If you’re looking for dozens of HiveQL templates, please visit Treasure Data’s example query catalog page.

SELECT Statement Syntax

Here’s the syntax of Hive’s SELECT statement.

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number]
;

SELECT is the projection operator in SQL. The points are:

  • SELECT scans the table specified by the FROM clause
  • WHERE gives the condition of what to filter
  • GROUP BY gives a list of columns which specify how to aggregate the records
  • CLUSTER BY, DISTRIBUTE BY, SORT BY specify the sort order and algorithm
  • LIMIT specifies how many # of records to retrieve

Computing with Columns

When you select the columns, you can manipulate column values using either arithmetic operators or function calls. Math, date, and string functions are popular.

Here’s an example query that uses both operators and functions.

SELECT upper(name), sales_cost FROM products;

WHERE Clauses

A WHERE clause is used to filter the result set by using predicate operators and logical operators. Functions can also be used to compute the condition.

Here’s an example query that uses a WHERE clause.

SELECT name FROM products WHERE name = 'stone of jordan';

GROUP BY Clauses

A GROUP BY clause is frequently used with aggregate functions, to group the result set by columns and apply aggregate functions over each group. Functions can also be used to compute the grouping key.

Here’s an example query that groups and counts by category.

SELECT category, count(1) FROM products GROUP BY category;

HAVING Clauses

A HAVING clause lets you filter the groups produced by GROUP BY, by applying predicate operators to each groups.

Here’s an example query that groups and counts by category, and then retrieves only counts > 10;

SELECT category, count(1) AS cnt FROM products GROUP BY category HAVING cnt > 10;

Example Queries

Here are some basic examples. The underlying table consists of three fields: ip, url, and time.

# Number of Records
SELECT COUNT(1) FROM www_access;

# Number of Unique IPs
SELECT COUNT(1) FROM ( \
  SELECT DISTINCT ip FROM www_access \
) t;

# Number of Unique IPs that Accessed the Top Page
SELECT COUNT(distinct ip) FROM www_access \
  WHERE url='/';

# Number of Accesses per Unique IP
SELECT ip, COUNT(1) FROM www_access \
  GROUP BY ip LIMIT 30;

# Unique IPs Sorted by Number of Accesses
SELECT ip, COUNT(1) AS cnt FROM www_access \
  GROUP BY ip
  ORDER BY cnt DESC LIMIT 30;

# Number of Accesses After a Certain Time
SELECT COUNT(1) FROM www_access \
  WHERE TD_TIME_RANGE(time, "2011-08-19", NULL, "PDT")

# Number of Accesses Each Day
SELECT \
  TD_TIME_FORMAT(time, "yyyy-MM-dd", "PDT") AS day, \
  COUNT(1) AS cnt \
FROM www_access \
GROUP BY TD_TIME_FORMAT(time, "yyyy-MM-dd", "PDT")

TD_TIME_RANGE UDF is simple and efficient to use. Please refer to the Performance Tuning article for more information.

Untitled-3
The `time` column is a special column that is always present and stores the UNIX timestamp of the log.

INSERT Statement Syntax

Here’s the syntax of Hive’s INSERT statement.

INSERT OVERWRITE TABLE tablename1 select_statement1 FROM from_statement;
INSERT INTO TABLE tablename1 select_statement1 FROM from_statement;
  • INSERT OVERWRITE will overwrite any existing data in the table.
  • INSERT INTO will append to the table, keeping the existing data intact.
  • If record doesn’t include time column, time column is imported TD_SCHEDULED_TIME().
  • If record includes time column, time column should be Unixtime.
Untitled-3
Query doesn't allow multiple inserts. NOTE: All INSERT INTO statements require FROM clause.

Hive Syntax Checker

If you would like to experiment with the Hive SQL syntax, we made our Hive Syntax linter publicly available here.

See Other Documents

Here is a list of documents regarding the HiveQL.

Here are the official Hive tutorial and language manual.


Last modified: Feb 22 2017 23:40:42 UTC

If this article is incorrect or outdated, or omits critical information, please let us know. For all other issues, please see our support channels.