Presto Delete Statement

Table of Contents

Why is Delete query statements important?

DELETE statements are helpful in a wide variety of situations, such as:

  • Cleaning up error records
  • Removing duplicate records
  • Removing unnecessary or unexpected results from result tables

Previously, Treasure Data only supported delete based on the td_partial_delete CLI command. This feature drastically improves the flexibility with which our users are able to delete & manage data in their tables.

How do I issue a DELETE query statement?

General Description

Presto DELETE query statements can be issued using the the following syntax

DELETE FROM table_name [ WHERE condition ]
Untitled-3
As a protection mechanism DELETE statements do require you to include a WHERE clause. This is to reduce the occurrence of unintentional DELETE commands.

Examples

Delete all line items shipped by air:

DELETE FROM lineitem WHERE shipmode = 'AIR'

Delete all event logs from a July 1st, 2017 UTC:

DELETE FROM lineitem WHERE TD_TIME_RANGE(time, '2017-07-01','2017-07-02')

Resource Consumption During Delete Statement

DELETE queries issue jobs & consume resources just like regular presto queries.

Untitled-3
Due to the nature of Treasure Data's environment, some of these queries can be quite resource intensive if not restricted by the partitioning of your table.

For example, if your table is only partitioned by time (partitioning based on columns other than ‘time’ is coming soon), and you are trying to delete a set of users that were created in June 2017, it is important to include the time restriction even if that’s not a critical parameter for your DELETE statement.

For example, you should:

DELETE FROM logtable WHERE userid IN (1234, 1235, 1236) AND TD_TIME_RANGE(time, '2017-06-01','2017-07-01')

Including the time restriction will significantly improve the speed of your query, and reduce the resources this query consumes.

Known Limitations

No DELETE for Recently Streaming Collected Data

One restriction on this functionality is that data that is recently collected (typically last 2-4 hours) via streaming ingestion method Can Not be deleted using DELETE statement. This is because streaming data is temporarily stored in our “real-time” data storage.

If you try to delete data in real-time storage, the query will fail to run. In these cases, users have 2 options: 1. Users can select a range of data that does not include data within real-time storage. 2. The data in real-time storage will be propagated to archive storage, and can then be deleted. This typically occurs within 2-6 hours after arrival, though it can be longer at times. 3. Users can add the following “magic comment” within the SQL statement, in order to force the query to run. In this case the query WILL STILL NOT DELETE data in real-time storage, but we will not cause the query to fail.

-- set session force_delete = 'true'

DELETE against DataTank

DELETE statements can not be issued against data stored in Data Tank. Please see our documentation regarding querying Data Tank from Presto for more details.

Multiple DELETE jobs on the same table are NOT allowed when there is overwrap on a partition file

Multiple DELETE jobs on the same table are allowed only when there’s no overwrap on a partition file. If the table is partitioned by 1-hour, it can be deleted with different time-range. But there could be large time range data files so it’s not always guaranteed to be executed. More precisely, multiple jobs cannot remove the same partition file at the same time. Also when a DELETE job is running and internal optimization process for partitioning merges (deletes small files and create large one) the data which DELETE involves, Metadata transaction conflicted with others error happens.

In this case, please retry the DELETE job manually.


Last modified: Dec 08 2017 07:12:43 UTC

If this article is incorrect or outdated, or omits critical information, let us know. For all other issues, access our support channels.