Learn more about OneDrive Export Integration.

You can connect OneDrive to import Personal, Business files, or SharePoint documents into Treasure Data.


Prerequisites

  • Basic knowledge of Treasure Data, including the TD Toolbelt

  • A OneDrive or SharePoint account 
    For the OAuth authentication, you need to give approval to yourself, because this integration does not support an authentication flow where the OneDrive user can obtain approval from an administrator.

  • Authorized Treasure Data account access

Incremental Loading for OneDrive

Incremental loading uses the last imported date of files to load records monotonically, inserting or updating files after the most recent execution.

At the first execution, this connector loads all files matching the Filename Regex and Modified After. If incremental : true is set, the latest modified DateTime will be saved as a new Modified After value.

Example:

  • Import folder contains files:

    +--------------+--------------------------+
    |   Filename   |     Last update          |
    +--------------+--------------------------+
    | File0001.csv | 2019-05-04T10:00:00.123Z |
    | File0011.csv | 2019-05-05T10:00:00.123Z |
    | File0012.csv | 2019-05-06T10:00:00.123Z |
    | File0013.csv | 2019-05-07T10:00:00.123Z |
    | File0014.csv | 2019-05-08T10:00:00.123Z | 
        
  • Filename Regex: File001.*.csv

  • Modified After: 2019-05-01T10:00:00.00Z

Then the files: File0011.csv, File0012.csv, File0013.csv, and File0014.csv are imported as they match the Filename Regex, and all having the last update > 2019-05-01T10:00:00.00Z.

After the job finished, new Modified After = 2019-05-08T10:00:00.123Z is saved.

At the next execution, only files having the last update > 2019-05-08T10:00:00.123Z are imported.

Example:

  • Import folder has newly updated and added files:

    +--------------+--------------------------+
    |   Filename   |     Last update          |
    +--------------+--------------------------+
    | File0001.csv | 2019-05-04T10:00:00.123Z |
    | File0011.csv | 2019-05-05T10:00:00.123Z |
    | File0012.csv | 2019-05-06T10:00:00.123Z |
    | File0013.csv | 2019-05-09T10:00:00.123Z |
    | File0014.csv | 2019-05-08T10:00:00.123Z | 
    | File0015.csv | 2019-05-09T10:00:00.123Z | 
        
  • Filename Regex: File001.*.csv

  • Modified After: 2019-05-08T10:00:00.123Z

Then only files: File0013.csv and File0015.csv are imported.

Domain Name and Relative Path

Domain Name and URL Relative Path are required to access your OneDrive business account plan, including your personal folder, SharePoint site collections, and folders shared with you.

Open the folder that you want to import on the browser. Then get the Domain Name and Relative Path as shown:

Personal folder



SharePoint collection



Folder shared with me


Get Your Own Refresh_Token

To issue jobs from CLI, you need client_id, client_secret, and refresh_token.

Register a developer account, and go to Azure App Registration Portal https://portal.azure.com/#home. On the search box, search for App Registration


Select New Registration.


Enter App Name, select Account Type and enter Redirect URI as shown. Select Register.


From the App Permissions tab, select Add a Permission.



From Request API permissions window, select Microsoft Graph -> Delegated permissions and add permissions as shown:


From the Certificates & secrets tab, select New client secret, select Never and Add.

A new client secret key is created as shown:

From the Overview tab, you see the client id as shown:


Replace the client_id in this URL and open it on a browser.

https://login.microsoftonline.com/common/oauth2/v2.0/authorize?client_id=xxxxxx&response_type=code&redirect_uri=http://localhost&response_mode=query&scope=offline_access files.readwrite.all openid sites.readwrite.all files.readwrite.appfolder&state=1234

Accept the Permission Request window.


Copy the code from the address bar:


Replace the code, client_id, client_secret and run a curl command to get the access_token and refresh_token

curl -X POST \
  https://login.microsoftonline.com/common/oauth2/v2.0/token \
  -H 'Content-Type: application/x-www-form-urlencoded' \
  -d 'client_id={xxxxxx}&client_secret={xxxxx}&grant_type=authorization_code&scope=offline_access%20files.readwrite.all%20openid%20sites.readwrite.all&code={xxxxxx}&redirect_uri=http://localhost'

Filename Regex Matching

The connector supports the Java language Regex for Filename matcher see the document.

For example: 

  • Match any files with .csv extension: .*.csv

  • Match any file start with prefix abc: abc.*

  • Match exactly file: abc.csv: abc.csv

Requested Permissions

OneDrive Connector will ask you for following permissions if you are using personal account:

Permission

Description

Note

User.Read

Allows users to sign-in to the app, and allows the app to read the profile of signed-in users. It also allows the app to read the basic company information of signed-in users.


Files.ReadWrite.AppFolder

Allows the app to read, create, update, and delete files in the application's folder.

Application's folder is a dedicated, special folder for "TreasureData OneDrive Connector" app see What is an App Folder

Files.ReadWrite.All

Allows the app to read, create, update, and delete all files the signed-in user can access.

The import and export Connector shares the same client application and only performs read, create and update but not delete files.

offline_access

Allows the app to read and update user data, even when they are not currently using the app.

Use for schedule import/export


Business/work account requests extra permission

Permission

Description

Note

Sites.ReadWrite.All

Read and write all groups, Edit or delete items in all site collections

only performs read, create and update but not delete files.

Note: Although the Delete permission is granted by default as part of the Sites. ReadWrite. All permission, the data connector does not delete files on Microsoft OneDrive.

Use the TD Console to Create Your Connection

Create a New Connection

When you configure a data connection, you provide authentication to access the integration. In Treasure Data, you configure the authentication and then specify the source information.

Go to Integrations Hub -> Catalog and search and select OneDrive.


The following dialog opens.


Access to Treasure Data OneDrive requires OAuth2 authentication. The authentication requires that users manually connect their Treasure Data account to their respective OneDrive account.

To authenticate, complete the following procedure:

Select Click here to connect to a new account.

Log into your OneDrive account in the popup window and grant access to the Treasure Data app.



You will be redirected back to TD Console. Repeat the first step (Create a new connection) and choose your new OAuth connection.


Name your new OneDrive Connection. Select Done.


Transfer Your OneDrive Data to Treasure Data

In Authentications, configure the New Source.



Complete the details and select Next.

OneDrive Connector supports two Account Plan types:

Business

Personal


Parameters:

  • OneDrive Account Plan: OneDrive account plan, supports two types:

    • Business

    • Personal

  • Ingest Shared Files: Applicable for the Business account plan only. Import files directly shared with you (different from folders shared with you).

  • Domain Name: Applicable for the Business account plan only. See Appendix Domain Name and Relative Path

  • Relative Path: Applicable for the Business account plan only. Specify the URL relative path. See Appendix Domain Name and Relative Path

  • Folder Path: Folder Path to import files E.g. /import/20190501

  • This is a shared folder: Applicable for the Personal account plan only. Specify the relative path to the shared folder. See FAQ How do I import a folder shared with me.

  • Filename RegEx: Only import files that match the provided filename regex. Read more about regular expressions. Also see Filename Regex Matching.

  • Modified After: Only import files modified after this timestamp. Timezone is UTC. If this parameter is empty, import all files. Timestamp format: yyyy-MM-dd'T'HH:mm:ss.SSSZ E.g. 2019-05-01T10:00:00.121Z

  • Incremental? When running repeatedly, attempt to import files only since the last import. See How Incremental Loading works.

Preview

You can see a preview of your data before running the import by selecting Generate Preview.

Data shown in the data preview is approximated from your source. It is not the actual data that is imported.

  1. Select Next.
    Data preview is optional and you can safely skip to the next page of the dialog if you want.

  2. To preview your data, select Generate Preview. Optionally, select Next

  3. Verify that the data looks approximately like you expect it to.


  4. Select Next.





Advanced Settings

Advanced Settings allow you to customize the guessed properties. Edit the following section, if needed.


  • Decoders: You can modify the guessed file decoder depends on the encoded files stored on OneDrive. Supported file decoders:

    • Gzip

    • Bzip2

    • Zip

    • OpenPGP Decryption

  • Parser: Update the File Parser if needed.

    • Columns:

      • Name: Changes the name of the column. Supported characters for column names are lowercase alphabets, numbers, and “_” (underscore) only.

      • Type: Parses a value as a specified type and stores the type as part of the Treasure Data schema.

        • boolean

        • long

        • timestamp: imported as String type at Treasure Data (Ex. 2017-04-01 00:00:00.000)

        • double

        • string

        • json

  • Retry Limit: Optional. A number of retries attempt when an error occurs before the system gives up. Default 7

  • Initial retry time wait in millis: Optional. The time, in milliseconds, between the first and second attempts. Default: 500 milliseconds

  • Max retry wait in millis: Optional. The time, in milliseconds, between the second and all subsequent attempts. Default: 300000 ~ 5 minutes.

  • HTTP Connect Timeout: Optional. The time, in milliseconds, HTTP connect timeout. Default: 60000 ~ 1 minute.

  • HTTP Read Timeout: Optional, The time, in milliseconds, HTTP read timeout. Default: 300000 ~ 5 minutes.

  • Minimum task size: When the import folder contains too many small files, it would cause degradation in import performance. Set this value to combine multiple files per task for performance gaining. E.g. 4000000 ~ 4MB

Choose the Target Database and Table

Choose an existing or create a new database and table.


Create a new database and give your database a name. Complete similar steps for Create new table.

Select whether to append records to an existing table or replace your existing table.

If you want to set a different partition key seed rather than use the default key, you can specify one using the popup menu.

Scheduling

In the Schedule tab, you can specify a one-time transfer, or you can schedule an automated recurring transfer. If you select Once now, select Start Transfer. If you select Repeat… specify your schedule options, then select Schedule Transfer.


Enter the New Source name and select DONE


After your transfer has run, you can see the results of your transfer in the Databases tab.


Use Command-Line to Create Your OneDrive Connection

You can use the TD Console to configure your connection.

Install the Treasure Data Toolbelt

Open a terminal and run the following command to install the newest TD Toolbelt.

$ gem install td
$ td --version
0.16.1

Create a Configuration File (seed.yml)

Using a text editor, create a file called seed.yml. Copy and paste the following information replacing the placeholder text with your OneDrive. Note that this configuration dumps all files content from /xxx folder in the target database because “replace” mode is specified. For more details on available out modes, see the Appendix.

If your account is a Business account plan

in:
  type: one_drive
  refresh_token: 'xxxxxxxxxxxxxxxx'
  client_id: 'xxxxxxxxxx'
  client_secret: 'xxxxxx'
  account_type: business
  domain_name: xxxx.xxxx.com
  server_relative_path: /xxx/xx
  folder_path: /xxx
out:
  mode: append

If your account is a Personal account plan

in:
  type: one_drive
  refresh_token: 'xxxxxxxxxxxxxxxx'
  client_id: 'xxxxxxxxxx'
  client_secret: 'xxxxxx'
  account_type: personal
  folder_path: /xxx
out:
  mode: append

Authentication

Specify the client_id, client_secret, and refresh_token for authenticating with OneDrive. For more information see Appendix. Get your own refresh_token.

Specify the domain_name, server_relative_path, and folder_path. See Appendix where to Domain Name and Relative Path

Run the Guess Fields Command (Generate a load.yml File)

Use the guess command if you are pulling in data from a schema-less database. Run the following command from your terminal:

$ td connector:guess seed.yml -o load.yml

connector:guess automatically reads the target data and intelligently guesses the data format.

Open the file load.yml. You’ll see guessed file format definitions including, in some cases, file formats, encodings, column names, and types.

Configuration keys and descriptions are as follows:

Config key

Type

Required

Description

account_type

string

yes

OneDrive account plan. business | personal

client_id

string

yes

OAuth client_id

client_secret

string

yes

OAuth client_secret

refresh_token

string

yes

See Get your own refresh_token.

is_shared_root

boolean

no

applicable for business accout_type only. When true, imports files directly shared with you (different from shared folders)

is_shared_folder

boolean

no

applicable for personal account_type only. importing files/folders shared with you.

domain_name

string

no

required when account_type = business.

server_relative_path

string

no

required when account_type = business.

folder_path

string

no

required when is_shared_root = false

last_modified_time

string

no

Only import files modified after the specified timestamp. Timezone is UTC. If this parameter is empty, import all files. Timestamp format: yyyy-MM-dd'T'HH:mm:ss.SSSZ E.g. 2019-05-01T10:00:00.121Z

name_match_pattern

string

no

Only import files that match the provided filename regex. See Filename Regex Matching

incremental

boolean

no

When running repeatedly, attempts to import files only since the last import.

retry_limit

integer

no

Optional. A number of retries attempt when an error occurs before the system gives up. Default 7 retries

retry_initial_wait_millis

integer

no

The time, in milliseconds, between the first and second attempt. Default: 500 milliseconds

max_retry_wait_millis

integer

no

The time, in milliseconds, between the second and all subsequent attempts. Default: 300000 ~ 5 minutes

connection_timeout_millis

long

no

The time, in milliseconds, HTTP connect timeout. Default: 60000 ~ 1 minute

read_timeout_millis

long

no

The time, in milliseconds, HTTP read timeout. Default: 300000 ~ 5 minutes.

min_task_size

long

no

When the import folder contains too many small files, it would cause degradation in import performance. Set this value to combine multiple files per task for performance gaining. E.g. 4000000 ~ 4MB

An example of load.yml with personal account plan, incremental, and column_options. 

in:
  type: one_drive
  refresh_token: 'xxxxxxxxxxxxxxxx'
  client_id: 'xxxxxxxxxx'
  client_secret: 'xxxxxx'
  account_type: personal
  folder_path: /xxx
  decoders:
  - {type: gzip}
  parser:
    charset: UTF-8
    newline: CRLF
    type: csv
    delimiter: ','
    quote: '"'
    escape: ''
    skip_header_lines: 1
    columns:
    - name: id
      type: long
    - name: company
      type: string
    - name: customer
      type: string
    - name: created_at
      type: timestamp
      format: '%Y-%m-%d %H:%M:%S'
out: {mode: append}
exec: {}

Optionally Preview the Data to be Imported

You can preview how the system will parse your data by using a preview command.

$ td connector:preview load.yml
+-------+---------+----------+---------------------+
| id    | company | customer | created_at          |
+-------+---------+----------+---------------------+
| 11200 | AA Inc. |    David | 2015-03-31 06:12:37 |
| 20313 | BB Imc. |      Tom | 2015-04-01 01:00:07 |
| 32132 | CC Inc. | Fernando | 2015-04-01 10:33:41 |
| 40133 | DD Inc. |    Cesar | 2015-04-02 05:12:32 |
| 93133 | EE Inc. |     Jake | 2015-04-02 14:11:13 |
+-------+---------+----------+---------------------+

The guess command needs over 3 rows and 2 columns in source data file, because it guesses column definition using sample rows from source data.

If the system detects your column name or column type unexpectedly, modify `load.yml` directly and runs preview again.

The Data Connector supports parsing of "boolean", "long", "double", "string", and "timestamp" types.

The preview command downloads one file from the specified bucket and displays the results from that file. This may cause a difference in results from the preview and issue commands.

Execute the Load Job

You use td connector:issue to execute the job. The following are required:

  • name of the schedule

  • cron-style schedule

  • database and table where their data will be stored

  • the Data Connector configuration file

$ td connector:issue load.yml --database td_sample_db --table td_sample_table --time-column created_at \
    daily_onedrive_import ...

If the database or the table do not exist in TD, td connector:issue will fail.

It’s also recommended to specify --time-column option, because Treasure Data’s storage is partitioned by time (see data partitioning). If the --time-column option is not available, the data connector selects the first long or timestamp column as the partitioning time. The type of the column specified by --time-column must be either of long or timestamp type (use Preview results to check for the available column name and type). A time column is available at the end of the output.

$ td connector:issue load.yml --database td_sample_db --table td_sample_table \
  --time-column created_at

If your data doesn’t have a time column you can add it using the add_time filter. You add the "time" column by adding the add_time filter to your configuration file as follows.

in:
  type: onedrive
  ...
filters:
- type: add_time
  from_value:
    mode: upload_time
  to_column:
    name: time
out:
  type: td

More details at add_time filter plugin.

The integration does not sort records server-side. To use time-based partitioning effectively, sort records in the files before running the import job.

If you have a field called time, you don’t need to specify the --time-column option.

$ td connector:issue load.yml --database td_sample_db --table td_sample_table

Scheduled Execution

You can schedule periodic data connector execution for periodic OneDrive import. We configure our scheduler carefully to ensure high availability. By using this feature, you no longer need a cron daemon on your local data center.

For the scheduled import, the data connector for OneDrive imports all objects that match the param name_match_pattern. The incremental, and last_modified_time params used to import files get updated, or new files added to folder_path since the last import only. See How Incremental Loading works.

Here’s an example of a seed file using incremental mode combined with ‘append’ mode for output.

in:
  type: one_drive
  refresh_token: 'xxxxxxxxxxxxxxxx'
  client_id: 'xxxxxxxxxx'
  client_secret: 'xxxxxx'
  account_type: personal
  folder_path: /xxx
  incremental: true
  decoders:
  - {type: gzip}
  parser:
    charset: UTF-8
    newline: CRLF
    type: csv
    delimiter: ','
    quote: '"'
    escape: ''
    skip_header_lines: 1
    columns:
    - name: id
      type: long
    - name: company
      type: string
    - name: customer
      type: string
    - name: created_at
      type: timestamp
      format: '%Y-%m-%d %H:%M:%S'
out: {mode: append}
exec: {}

Create the Schedule

A new schedule can be created using the td connector:create command. The following are required:

  • name of the schedule

  • cron-style schedule

  • database and table where their data will be stored

  • Data Connector configuration file

$ td connector:create \
    daily_import \
    "10 0 * * *" \
    td_sample_db \
    td_sample_table \
    load.yml

Remember to specify the --time-column option, because Treasure Data’s storage is usually partitioned by time.

$ td connector:create \
    daily_import \
    "10 0 * * *" \
    td_sample_db \
    td_sample_table \
    load.yml \
    --time-column created_at

The `cron` parameter also accepts three special options: `@hourly`, `@daily` and `@monthly`.

By default, schedule is setup in UTC timezone. You can set the schedule in a timezone using -t or --timezone option. `--timezone` option only supports extended timezone formats like 'Asia/Tokyo', 'America/Los_Angeles' etc. Timezone abbreviations like PST, CST are *not* supported and may lead to unexpected schedules.

List the Schedules

You can see the list of scheduled entries by running the command td connector:list.

$ td connector:list
+--------------+------------+----------+-------+--------------+-----------------+----------------------------+
| Name         | Cron       | Timezone | Delay | Database     | Table           | Config                     |
+--------------+------------+----------+-------+--------------+-----------------+----------------------------+
| daily_import | 10 0 * * * | UTC      | 0     | td_sample_db | td_sample_table | {"in"=>{"type"=>"onedrive",|
+--------------+------------+----------+-------+--------------+-----------------+----------------------------+

Show the Settings and Schedule History

td connector:show shows the execution settings of a schedule entry.

% td connector:show daily_import
Name     : daily_import
Cron     : 10 0 * * *
Timezone : UTC
Delay    : 0
Database : td_sample_db
Table    : td_sample_table
Config
---
in:
  type: one_drive
  refresh_token: 'xxxxxxxxxxxxxxxx'
  client_id: 'xxxxxxxxxx'
  client_secret: 'xxxxxx'
  account_type: personal
  folder_path: /xxx
  incremental: true
  decoders:
  - {type: gzip}
  parser:
    charset: UTF-8
    newline: CRLF
    type: csv
    delimiter: ','
    quote: '"'
    ...

td connector:history shows the execution history of a schedule entry. To investigate the results of each individual run, use td job <jobid>.

% td connector:history daily_import
+--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+
| JobID  | Status  | Records | Database     | Table           | Priority | Started                   | Duration |
+--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+
| 578066 | success | 10000   | td_sample_db | td_sample_table | 0        | 2019-05-18 00:10:05 +0000 | 160      |
| 577968 | success | 10000   | td_sample_db | td_sample_table | 0        | 2019-05-17 00:10:07 +0000 | 161      |
| 577914 | success | 10000   | td_sample_db | td_sample_table | 0        | 2019-05-16 00:10:03 +0000 | 152      |
| 577872 | success | 10000   | td_sample_db | td_sample_table | 0        | 2019-05-15 00:10:04 +0000 | 163      |
| 577810 | success | 10000   | td_sample_db | td_sample_table | 0        | 2019-05-14 00:10:04 +0000 | 164      |
| 577766 | success | 10000   | td_sample_db | td_sample_table | 0        | 2019-05-13 00:10:04 +0000 | 155      |
| 577710 | success | 10000   | td_sample_db | td_sample_table | 0        | 2019-05-12 00:10:05 +0000 | 156      |
| 577610 | success | 10000   | td_sample_db | td_sample_table | 0        | 2019-05-11 00:10:04 +0000 | 157      |
+--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+
8 rows in set

Delete the Schedule

td connector:delete removes the schedule.

$ td connector:delete daily_import

FAQ for Import from OneDrive

Q: How do I import a folder that someone shared with me?

Business Account Plan:

You need the domain name, server relative path, and folder path. Go to the Shared With Me tab on the browser:


Select the Shared folder and make note of the domain name, relative path, and folder path as described in Domain Name and Relative Path.

Personal Account Plan

You must set the field: This is a shared folder (is_shared_folder) to true, and drill down to the shared folder as shown in the image.


Select the shared folder f1 and navigate to the f2 target folder to import.


Input the folder path param as shown: folder_path: /f1/f2 (note: folder f0 is not shared with you)


Q: How can I import files from SharePoint collections?

You need a Domain Name, server Relative Path, and Folder Path to import files from SharePoint collections. See Domain Name and Relative Path

Q: Why I got an error "Attribute type is required but not set"?

Your import folder contains empty files or your files do not have a line break character. Delete the empty files, and make sure import files have at least 2 lines of data.

Q: We need to obtain approval from OneDrive admin user for accessing the app. After obtained approval from the admin, the oatuh connection doesn't show up on the connection list

This is because our connector doesn't support such OAuth flow (requesting approval with OAuth). Thus please ask your OneDrive admin user to create authentication on behalf of you.

Appendix

Modes for the Out Plugin

You can specify file import mode in the out section of the load.yml file.

The out section controls how data is imported into a Treasure Data table.
For example, you may choose to append data or replace data in an existing table in Treasure Data.

Output modes are ways to modify the data as the data is placed in Treasure Data.

  • Append (default): Records are appended to the target table.

  • Replace (available In td 0.11.10 and later): Replaces data in the target table. Any manual schema changes made to the target table remain intact.

Examples:

in:
  ...
out:
  mode: append


in:
  ...
out:
  mode: replace



  • No labels