Hive and Other Hadoop-Based Resource Pools

Treasure Data Resource Pools for Hadoop allow you to divide resources available for Hive (and bulk import and and td table:export, which also run on Hadoop) into pools to be used for specific workloads. You can then organize the use of those resources across project, groups, or use cases.

Resource pools are helpful for the following challenges:

  • Ensuring that high priority jobs receive the resources they need to run within a strict SLA— for example, a batch job that must finish within a specific overnight time window

  • Always saving some resources for ad hoc jobs that have to run with lowest possible latency

  • Limiting the maximum resources and cost of running lower priority jobs

You can guarantee the minimum and maximum resources to allot for jobs that are running in specific resource pools.

Table of Contents

Setup

This feature is enabled upon request. Treasure Data support can enable this feature and configure your desired resource pools. Contact support or your primary account representative if you want to use Resource Pools.

Hadoop Resources and How Resource Pools Affect them

Treasure Data’s Hadoop cluster is shared among many customers. Each customer starts with a single queue for all submitted Hadoop jobs, such as Hive queries, which are processed as these resources permit. A number of parallel processes, based on your plan, is dedicated to processing those jobs. The number of cores is determined based on Hadoop Compute Units, as follows:

  • A customer gets a total of 2 cores minimum guaranteed processing per compute unit at all times. For example, a customer with 20 Hadoop compute units gets 40 cores of minimum processing at all times.

  • During off-peak periods on Treasure Data’s Hadoop cluster (across all customers), a customer may be granted up to 4x their guaranteed compute cores, to process their jobs faster.

When you create multiple resource pools, you designate the minimum percentage of resources to devote to them. For example, you might create three pools:

  • “adhoc” assigned 25% of plan capacity
  • “batch”, assigned 65% of capacity
  • “Best_effort” assigned 10% of capacity

Based on the assigned percentage, a minimum of CPU cores is guaranteed for the jobs. So:

  • “ad hoc” jobs will always get at least 10 (=25% of 40) cores
  • “batch” jobs will always get at least 26 (=65% of 40) cores
  • “Best_effort” jobs will always get at least 4 (=10% of 40) cores

The values are rounded down to the nearest whole number of cores. Resource pools are guaranteed a minimum of one core, unless configured to 0% allocation.

During off-peak periods, jobs might be granted up to the full off-peak capacity of the whole plan. For example, any job could be granted up to 160 (=40*4) cores if the overall Treasure Data Hadoop workload permits.

Untitled-3
If a job from one resource pool is running with more than its guaranteed resources and a job is submitted to another pool for which there are not currently enough resources, some portion of the processing for the first job may be pre-empted or deferred. This pre-emption does not mean that the job has to restart, only that it will take somewhat longer to run to completion. Jobs are always granted their minimum guaranteed resources.

Default Resource Pool Configuration

The default resource pool configuration for a Hadoop plan is for there to be one resource pool, named either hadoop2 or hdp2. The default is configured with 100% resource.

To determine your current configured resource pools, contact Treasure Data support.

Selecting Which Resource Pool Your Job Will Run On

If resource pools are configured, and no resource pool is specified for a specific job, the job will run in the default resource pool. The default is configured with 100% resource and named either hadoop2 or hdp2.

If you want to choose a specific resource pool for jobs that do not have a specified resource pool, contact Treasure Data support. They can configure it for you.

To run a job in a specific resource pool, the instructions depend on what kind of job and how you are submitting it.

Hive Queries

For Hive queries, you can specify a query hint/magic comment to specify a resource pool. For example:

-- @TD pool_name: batch
select count(1) from mytable;

Using the CLI

To specify a resource pool for a Hive query, table export or bulk import job at the command line, add the —pool-name argument to td. For example:

A Hive query can be run as follows:

td query --type hive --database <database_name> --pool-name <resource_pool_name> "select count(1) from mytable”
Untitled-3
If you already have a resource pool name specified by a Hive query hint, providing a conflicting --pool-name argument on the command line causes the job to fail.

A table export can be run as follows:

td table:export example_db table1 --s3-bucket mybucket -k KEY_ID -s SECRET_KEY --pool-name batch

A bulk import can be run as follows:

td import:perform logs_201201 --pool-name batch

Frequently Asked Questions

What if I submit a job with a resource pool name that is not configured?

The job will run in the default resource pool with its configured available resources. The default is configured with 100% resource.

Further Support

Contact support@treasure-data.com if you have questions about this feature.


Last modified: Feb 27 2018 20:07:30 UTC

If this article is incorrect or outdated, or omits critical information, let us know. For all other issues, access our support channels.