CSV Parser Function
Copy for LLM
Copy page as Markdown for LLMs
View as Markdown
Open this page as Markdown
Open in ChatGPT
Get insights from ChatGPT
Open in Claude
Get insights from Claude
Connect to Cursor
Install MCP server on Cursor
Connect to VS Code
Install MCP server on VS Code

The CSV parser plugin for Treasure Data's integrations parses CSV data. It has 15 options as explained in the following table. For more information, see CSV parser plugin.

Option	Description
delimiter	Delimiter character. This can be any single-byte character. For example, for CSV you could us \t, or for TSV you could use
use_string_literal	Checked when you want to use special characters such as \t in your configuration
quote	The character surrounding a quoted value. Setting \0 disables quoting.
escape	Escape character to escape a special character. Setting \0 disables escaping.
skip_header_lines	Skip this number of lines first. Set 1 if the file has a header line.
null_string	If a value is in this string, converts it to NULL. For example, set \N for CSV files created by mysqldump
trim_if_not_quoted	If true, remove spaces of a value if the value is not surrounded by the quote character
comment_line_marker	Skip a line if the line begins with this string
allow_optional_columns	If true, set null to insufficient columns. Otherwise, skip the row in case of insufficient number of columns. Because the CSV parser can't know exactly missing columns when parse row, any missing columns will be considered these last columns.
allow_extra_columns	If true, ignore too many columns. Otherwise, skip the row in case of too many columns.
max_quoted_size_limit	Maximum number of bytes of a quoted value. If a value exceeds the limit, the row will be skipped
stop_on_invalid_record	Stop bulk load transaction if a file includes invalid record (such as invalid timestamp)
default_timezone	Time zone of timestamp columns if the value itself doesn’t include time zone description (eg. Asia/Tokyo)
newline	Newline character (CRLF, LF or CR)
line_delimiter_recognized	If true, the specified newline character is processed as a new line, and the character is also preserved as data.
charset	Character encoding (eg. ISO-8859-1, UTF-8). See Supported Values for charset Option in CSV Parser.
quotes_in_quoted_fields	Specify how to deal with irregular unescaped quote characters in quoted fields. Specify either ACCEPT_ONLY_RFC4180_ESCAPED or ACCEPT_STRAY_QUOTES_ASSUMING_NO_DELIMITERS_IN_FIELDS.
columns	Columns (see below)

The columns option declares the list of columns. This CSV parser plugin ignores the header line. The order of columns will be used to map csv columns (by their orders), and discard the header (if exists).

Column	Description
name	Name of the column
type	Type of the column (see below). boolean : true or false long : 64-bit signed integers timestamp : Date and time with nano-seconds precision double : 64-bit floating point numbers string : Strings
format	Format of the timestamp if type is timestamp<

Quotes in Quoted Fields Options

The quotes_in_quotes_fields option specifies how to deal with irregular non-escaped stray quote characters.

Name	Description
ACCEPT_ONLY_RFC4180_ESCAPED	Default. CSV parsing is RFC 4180 compliant.
ACCEPT_STRAY_QUOTES_ASSUMING_NO_DELIMITERS_IN_FIELDS	Accept stray quotes as-is in between quotation marks. If there are delimiters inside the quotation marks, the behavior is undefined. For example, if the field were "a"b", the middle quotation mark would be treated as just another character, resulting in a"b. However, if the field contained "a""b", the middle two quotation marks would be considered an escaped quotation mark, resulting in a"b.

Name

Description

ACCEPT_ONLY_RFC4180_ESCAPED

Default. CSV parsing is RFC 4180 compliant.

ACCEPT_STRAY_QUOTES_ASSUMING_NO_DELIMITERS_IN_FIELDS

Accept stray quotes as-is in between quotation marks. If there are delimiters inside the quotation marks, the behavior is undefined.

For example, if the field were "a"b", the middle quotation mark would be treated as just another character, resulting in a"b. However, if the field contained "a""b", the middle two quotation marks would be considered an escaped quotation mark, resulting in a"b.

CSV Parser Configuration

You can use guess to automatically generate the column settings. Or set the parser section like an example:

in:
...
  parser:
    type: csv
    charset: UTF-8
    newline: CRLF
    delimiter: "\t"
    quote: '"'
    escape: '"'
    null_string: 'NULL'
    skip_header_lines: 1
    comment_line_marker: '#'
    columns:
    - {name: id, type: long}
    - {name: account, type: long}
    - {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
    - {name: purchase, type: timestamp, format: '%Y%m%d'}
    - {name: comment, type: string}
out:
...

CSV Parser Use Case Examples

Here are some example use cases for the CSV parser function:

Simple CSV
[Using null_string,trim_if_not_quoted,comment_line_marker and max_quoted_size_limit](<csv-parser-function.md#csv-parser-use-case-example-2)
Effect of FALSE and TRUE with allow_optional_columns
Effect of FALSE and TRUE with allow_extra_columns
Using \0 as the quote and escape character

CSV Parser Use Case Example 1

Here is an example of using simple CSV with the parser function.

Example of Source data

id,name,price,tag,timestamp
1,"A green door",11.50,"green","2016-01-02"
2,"A blue door",12.50,"blue","2016-01-03"
3,"A red door",13.50,"red","2016-01-04"
4,"A pink door",14.50,"pink","2016-01-05"
5,"A white door",15.50,"white","2016-01-06"
6,"A black door",16.50,"black","2016-01-07"
7,"A yellow door",17.50,"yellow","2016-01-08"
8,"A purple door",18.50,"purple","2016-01-09"

in:
...
parser:
  charset: UTF-8
  newline: CRLF
  type: csv
  delimiter: ","
  quote: "\""
  escape: "\""
  trim_if_not_quoted: false
  skip_header_lines: 1
  allow_extra_columns: false
  allow_optional_columns: false
  columns:
  - {name: id, type: long}
  - {name: name, type: string}
  - {name: price, type: double}
  - {name: tag, type: string}
  - {name: timestamp, type: timestamp, format: '%Y-%m-%d'}
filters: []
out: {mode: append}
exec: {}

You can show parsed CSV data by the preview command.

$ td connector:preview load.yml

If your data doesn’t have a time column you may add it using add_time filter option. More details at add_time filter function.

$ td connector:issue load.yml \
--database <database name> \
--table <table name>  \
--time-column timestamp --auto-create-table

CSV Parser Use Case Example 2

Here are examples for using

null_string
trim_if_not_quoted
comment_line_marker
max_quoted_size_limit

Source data is.

id,name,price,tag,timestamp
1,"A green door",11.50,"green","2016-01-02"
2,"A blue door",12.50,"blue","2016-01-03"
3,"A red door",13.50,"red","2016-01-04"
4,"A pink door",14.50,"pink","2016-01-05"
5,"A white door",15.50,"   white   ","2016-01-06"
6,"A black door",16.50,    black    ,"2016-01-07"
7,"A yellow door",17.50,"yellow","2016-01-08"
8,"A purple doooooooooooooooooooooooor",18.50,"purple","2016-01-09"

If you use following load.yml for it.

in:
...
parser:
  charset: UTF-8
  newline: CRLF
  type: csv
  delimiter: ","
  quote: "\""
  escape: "\""
  skip_header_lines: 1
  null_string "red"
  trim_if_not_quoted: true
  comment_line_marker: "2"
  max_quoted_size_limit: 20
  columns:
  - {name: id, type: long}
  - {name: name, type: string}
  - {name: price, type: double}
  - {name: tag, type: string}
  - {name: timestamp, type: timestamp, format: '%Y-%m-%d'}
filters: []
out: {mode: append}
exec: {}

The preview result is.

$ td connector:preview load.yml

+---------+-----------------+--------------+---------------+---------------------------+
| id:long | name:string     | price:double | tag:string    | timestamp:timestamp       |
+---------+-----------------+--------------+---------------+---------------------------+
| 1       | "A green door"  | 11.5         | "green"       | "2016-01-02 00:00:00 UTC" |
| 3       | "A red door"    | 13.5         | nil           | "2016-01-04 00:00:00 UTC" |
| 4       | "A pink door"   | 14.5         | "pink"        | "2016-01-05 00:00:00 UTC" |
| 5       | "A white door"  | 15.5         | "   white   " | "2016-01-06 00:00:00 UTC" |
| 6       | "A black door"  | 16.5         | "black"       | "2016-01-07 00:00:00 UTC" |
| 7       | "A yellow door" | 17.5         | "yellow"      | "2016-01-08 00:00:00 UTC" |
+---------+-----------------+--------------+---------------+---------------------------+

null_string replaced the “red” tag to nil. (“A red door” is not target, because of not full match.)
trim_if_not_quoted trimmed “black” tag’s Beginning and Ending spaces because it’s not quoted. (“white” tag is not target, because it’s quoted.)
comment_line_marker skipped “blue” record because it raw beginning with “2”.
max_quoted_size_limit skipped “purple” record because it name exceeded value.

CSV Parser Use Case Example 3

Here is an example of how setting allow_optional_columns to FALSE (default) or TRUE affects the parsing results.

Source data is.

id,name,price,tag,timestamp,additional
1,"A green door",11.50,"green","2016-01-02",aaa
2,"A blue door",12.50,"blue","2016-01-03",aaa
3,"A red door",13.50,"red","2016-01-04"
4,"A pink door",14.50,"pink","2016-01-05",aaa
5,"A white door",15.50,"white","2016-01-06",aaa
6,"A black door",16.50,"black","2016-01-07"
7,"A yellow door",17.50,"yellow","2016-01-08",aaa
8,"A purple door",18.50,"purple","2016-01-09",aaa

load.yml setting as below

in:
...
parser:
  ...
  allow_optional_columns: [ false / true ]
  columns:
  - {name: id, type: long}
  - {name: name, type: string}
  - {name: price, type: double}
  - {name: tag, type: string}
  - {name: timestamp, type: timestamp, format: '%Y-%m-%d'}
  - {name: additional, type: string}
filters: []
out: {mode: append}
exec: {}

Result of false doesn’t have “red” and “black” rows because they don’t have “additional” column values.

$ td connector:preview load.yml

+---------+-----------------+--------------+------------+---------------------------+-------------------+
| id:long | name:string     | price:double | tag:string | timestamp:timestamp       | additional:string |
+---------+-----------------+--------------+------------+---------------------------+-------------------+
| 1       | "A green door"  | 11.5         | "green"    | "2016-01-02 00:00:00 UTC" | "aaa"             |
| 2       | "A blue door"   | 12.5         | "blue"     | "2016-01-03 00:00:00 UTC" | "aaa"             |
| 4       | "A pink door"   | 14.5         | "pink"     | "2016-01-05 00:00:00 UTC" | "aaa"             |
| 5       | "A white door"  | 15.5         | "white"    | "2016-01-06 00:00:00 UTC" | "aaa"             |
| 7       | "A yellow door" | 17.5         | "yellow"   | "2016-01-08 00:00:00 UTC" | "aaa"             |
| 8       | "A purple door" | 18.5         | "purple"   | "2016-01-09 00:00:00 UTC" | "aaa"             |
+---------+-----------------+--------------+------------+---------------------------+-------------------+

On the other hand, result of true is filled null.

+---------+-----------------+--------------+------------+---------------------------+-------------------+
| id:long | name:string     | price:double | tag:string | timestamp:timestamp       | additional:string |
+---------+-----------------+--------------+------------+---------------------------+-------------------+
| 1       | "A green door"  | 11.5         | "green"    | "2016-01-02 00:00:00 UTC" | "aaa"             |
| 2       | "A blue door"   | 12.5         | "blue"     | "2016-01-03 00:00:00 UTC" | "aaa"             |
| 3       | "A red door"    | 13.5         | "red"      | "2016-01-04 00:00:00 UTC" | nil               |
| 4       | "A pink door"   | 14.5         | "pink"     | "2016-01-05 00:00:00 UTC" | "aaa"             |
| 5       | "A white door"  | 15.5         | "white"    | "2016-01-06 00:00:00 UTC" | "aaa"             |
| 6       | "A black door"  | 16.5         | "black"    | "2016-01-07 00:00:00 UTC" | nil               |
| 7       | "A yellow door" | 17.5         | "yellow"   | "2016-01-08 00:00:00 UTC" | "aaa"             |
| 8       | "A purple door" | 18.5         | "purple"   | "2016-01-09 00:00:00 UTC" | "aaa"             |
+---------+-----------------+--------------+------------+---------------------------+-------------------+

CSV Parser Use Case Example 4

Here is an example of how setting allow_extra_columns to FALSE (default) or TRUE affects the parsing results.

The source data is:

id,name,price,tag,timestamp,additional
1,"A green door",11.50,"green","2016-01-02",aaa
2,"A blue door",12.50,"blue","2016-01-03",aaa
3,"A red door",13.50,"red","2016-01-04"
4,"A pink door",14.50,"pink","2016-01-05",aaa
5,"A white door",15.50,"white","2016-01-06",aaa
6,"A black door",16.50,"black","2016-01-07"
7,"A yellow door",17.50,"yellow","2016-01-08",aaa
8,"A purple door",18.50,"purple","2016-01-09",aaa

The load.yml setting is as below:

in:
...
parser:
  ...
  allow_extra_columns: [ false / true ]
  columns:
  - {name: id, type: long}
  - {name: name, type: string}
  - {name: price, type: double}
  - {name: tag, type: string}
  - {name: timestamp, type: timestamp, format: '%Y-%m-%d'}
filters: []
out: {mode: append}
exec: {}

Result of false shows only “red” and “black” rows because “additional” column isn’t defined in columns and other rows have undefined column values.

$ td connector:preview load.yml

+---------+----------------+--------------+------------+---------------------------+
| id:long | name:string    | price:double | tag:string | timestamp:timestamp       |
+---------+----------------+--------------+------------+---------------------------+
| 3       | "A red door"   | 13.5         | "red"      | "2016-01-04 00:00:00 UTC" |
| 6       | "A black door" | 16.5         | "black"    | "2016-01-07 00:00:00 UTC" |
+---------+----------------+--------------+------------+---------------------------+

On the other hand, result of true shows all rows without “additional” columns.

$ td connector:preview load.yml

+---------+-----------------+--------------+------------+---------------------------+
| id:long | name:string     | price:double | tag:string | timestamp:timestamp       |
+---------+-----------------+--------------+------------+---------------------------+
| 1       | "A green door"  | 11.5         | "green"    | "2016-01-02 00:00:00 UTC" |
| 2       | "A blue door"   | 12.5         | "blue"     | "2016-01-03 00:00:00 UTC" |
| 3       | "A red door"    | 13.5         | "red"      | "2016-01-04 00:00:00 UTC" |
| 4       | "A pink door"   | 14.5         | "pink"     | "2016-01-05 00:00:00 UTC" |
| 5       | "A white door"  | 15.5         | "white"    | "2016-01-06 00:00:00 UTC" |
| 6       | "A black door"  | 16.5         | "black"    | "2016-01-07 00:00:00 UTC" |
| 7       | "A yellow door" | 17.5         | "yellow"   | "2016-01-08 00:00:00 UTC" |
| 8       | "A purple door" | 18.5         | "purple"   | "2016-01-09 00:00:00 UTC" |
+---------+-----------------+--------------+------------+---------------------------+

CSV Parser Use Case Example 5

Here is an example of specifying \0 as the quote and escape character.