You might encounter queries that take a very long time to finish. For example, suppose you query the IP addresses of your website visitors which you want to convert into countries and then make a daily summary of all visitors by each country. The query examines the data of IP addresses from the first visitor to the most recent. This query is processed over and over again for each query, resulting in a query that takes over an hour to complete.

Stream Data Processing

Stream data processing uses intermediate tables where only the new data (data that arrived since the last query) is processed and appended to the rest of the processed data, rather than processing the entire data every time.

How it Works

  1. Create an intermediate table by processing all the data you have up to a certain point in time.

  2. Schedule a query that processes new data that arrived since the last query and appends the data to the intermediate table.

  3. Query the intermediate table for the aggregated data.

1. Create Intermediate Table

The following example uses these elements to create the table:

  • Sample database: sample_db

  • Intermediate table: visitor_country

  • Raw data table: visitor_raw

CREATE TABLE visitor_country AS SELECT
  TD_DATE_TRUNK(‘day’, time, ‘PST’) AS time,
  userid,
  TD_IP_TO_COUNTRY_CODE(ip) country
FROM visitor_raw
WHERE TD_TIME_RANGE(time, null, ‘2017-07-01’, ‘PST’)

2. Schedule and Append to Intermediate Table

INSERT INTO visitor_country
SELECT
  TD_DATE_TRUNK(‘day’, time, ‘PST’) AS time,
  userid,
  TD_IP_TO_COUNTRY_CODE(ip) country
FROM visitor_raw
WHERE TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), ‘-1d’), TD_SCHEDULED_TIME(), ‘PST’);

3. Aggregate the intermediate table

SELECT country, approx_distinct(userid) num_users
FROM vistitor_country
GROUP BY 1
  • No labels