You can connect OneDrive to import Personal, Business files, or SharePoint documents into Treasure Data.
Basic knowledge of Treasure Data, including the TD Toolbelt
A OneDrive or SharePoint account
For the OAuth authentication, you need to give approval to yourself, because this integration does not support an authentication flow where the OneDrive user can obtain approval from an administrator.
Authorized Treasure Data account access
Incremental Loading for OneDrive
Incremental loading uses the last imported date of files to load records monotonically, inserting or updating files after the most recent execution.
At the first execution, this connector loads all files matching the Filename Regex and Modified After. If incremental
: true is set, the latest modified DateTime will be saved as a new Modified After value.
Import folder contains files:
Filename Regex: File001.*.csv
Modified After: 2019-05-01T10:00:00.00Z
Then the files: File0011.csv, File0012.csv, File0013.csv, and File0014.csv are imported as they match the Filename Regex, and all having the last update > 2019-05-01T10:00:00.00Z.
After the job finished, new Modified After = 2019-05-08T10:00:00.123Z is saved.
At the next execution, only files having the last update > 2019-05-08T10:00:00.123Z are imported.
Import folder has newly updated and added files:
Filename Regex: File001.*.csv
Modified After: 2019-05-08T10:00:00.123Z
Then only files: File0013.csv and File0015.csv are imported.
Domain Name and Relative Path
Domain Name and URL Relative Path are required to access your OneDrive business account plan, including your personal folder, SharePoint site collections, and folders shared with you.
Open the folder that you want to import on the browser. Then get the Domain Name and Relative Path as shown:
Folder shared with me
Get Your Own Refresh_Token
To issue jobs from CLI, you need client_id, client_secret, and refresh_token.
Register a developer account, and go to Azure App Registration Portal https://portal.azure.com/#home. On the search box, search for App Registration
Select New Registration.
Enter App Name, select Account Type and enter Redirect URI as shown. Select Register.
From the App Permissions tab, select Add a Permission.
From Request API permissions window, select Microsoft Graph -> Delegated permissions and add permissions as shown:
From the Certificates & secrets tab, select New client secret, select Never and Add.
A new client secret key is created as shown:
From the Overview tab, you see the client id as shown:
Replace the client_id in this URL and open it on a browser.
Accept the Permission Request window.
Copy the code from the address bar:
Replace the code, client_id, client_secret and run a curl command to get the access_token and refresh_token
Filename Regex Matching
The connector supports the Java language Regex for Filename matcher see the document.
Match any files with .csv extension: .*.csv
Match any file start with prefix abc: abc.*
Match exactly file: abc.csv: abc.csv
OneDrive Connector will ask you for following permissions if you are using personal account:
Allows users to sign-in to the app, and allows the app to read the profile of signed-in users. It also allows the app to read the basic company information of signed-in users.
Allows the app to read, create, update, and delete files in the application's folder.
Application's folder is a dedicated, special folder for "TreasureData OneDrive Connector" app see What is an App Folder
Allows the app to read, create, update, and delete all files the signed-in user can access.
The import and export Connector shares the same client application and only performs read, create and update but not delete files.
Allows the app to read and update user data, even when they are not currently using the app.
Use for schedule import/export
Business/work account requests extra permission
Read and write all groups, Edit or delete items in all site collections
only performs read, create and update but not delete files.
Note: Although the Delete permission is granted by default as part of the Sites. ReadWrite. All permission, the data connector does not delete files on Microsoft OneDrive.
Use the TD Console to Create Your Connection
Create a New Connection
When you configure a data connection, you provide authentication to access the integration. In Treasure Data, you configure the authentication and then specify the source information.
Go to Integrations Hub -> Catalog and search and select OneDrive.
The following dialog opens.
Access to Treasure Data OneDrive requires OAuth2 authentication. The authentication requires that users manually connect their Treasure Data account to their respective OneDrive account.
To authenticate, complete the following procedure:
Select Click here to connect to a new account.
Log into your OneDrive account in the popup window and grant access to the Treasure Data app.
You will be redirected back to TD Console. Repeat the first step (Create a new connection) and choose your new OAuth connection.
Name your new OneDrive Connection. Select Done.
Transfer Your OneDrive Data to Treasure Data
In Authentications, configure the New Source.
Complete the details and select Next.
OneDrive Connector supports two Account Plan types:
OneDrive Account Plan: OneDrive account plan, supports two types:
Ingest Shared Files: Applicable for the Business account plan only. Import files directly shared with you (different from folders shared with you).
Domain Name: Applicable for the Business account plan only. See Appendix Domain Name and Relative Path
Relative Path: Applicable for the Business account plan only. Specify the URL relative path. See Appendix Domain Name and Relative Path
Folder Path: Folder Path to import files E.g. /import/20190501
This is a shared folder: Applicable for the Personal account plan only. Specify the relative path to the shared folder. See FAQ How do I import a folder shared with me.
Modified After: Only import files modified after this timestamp. Timezone is UTC. If this parameter is empty, import all files. Timestamp format: yyyy-MM-dd'T'HH:mm:ss.SSSZ E.g. 2019-05-01T10:00:00.121Z
Incremental? When running repeatedly, attempt to import files only since the last import. See How Incremental Loading works.
You can see a preview of your data before running the import by selecting Generate Preview. Data shown in the data preview is approximated from your source. It is not the actual data that is imported. Click Next. To preview your data, select Generate Preview. Optionally, click Next. Verify that the data looks approximately like you expect it to. Select Next.
Data preview is optional and you can safely skip to the next page of the dialog if you want.
Data shown in the data preview is approximated from your source. It is not the actual data that is imported.
To preview your data, select Generate Preview. Optionally, click Next.
Verify that the data looks approximately like you expect it to.
Advanced Settings allow you to customize the guessed properties. Edit the following section, if needed.
Decoders: You can modify the guessed file decoder depends on the encoded files stored on OneDrive. Supported file decoders:
Parser: Update the File Parser if needed.
Name: Changes the name of the column. Supported characters for column names are lowercase alphabets, numbers, and “_” (underscore) only.
Type: Parses a value as a specified type and stores the type as part of the Treasure Data schema.
timestamp: imported as String type at Treasure Data (Ex. 2017-04-01 00:00:00.000)
Retry Limit: Optional. A number of retries attempt when an error occurs before the system gives up. Default 7
Initial retry time wait in millis: Optional. The time, in milliseconds, between the first and second attempts. Default: 500 milliseconds
Max retry wait in millis: Optional. The time, in milliseconds, between the second and all subsequent attempts. Default: 300000 ~ 5 minutes.
HTTP Connect Timeout: Optional. The time, in milliseconds, HTTP connect timeout. Default: 60000 ~ 1 minute.
HTTP Read Timeout: Optional, The time, in milliseconds, HTTP read timeout. Default: 300000 ~ 5 minutes.
Minimum task size: When the import folder contains too many small files, it would cause degradation in import performance. Set this value to combine multiple files per task for performance gaining. E.g. 4000000 ~ 4MB
Choose the Target Database and Table
Choose an existing or create a new database and table.
Create a new database and give your database a name. Complete similar steps for Create new table.
Select whether to append records to an existing table or replace your existing table.
If you want to set a different partition key seed rather than use the default key, you can specify one using the popup menu.
In the Schedule tab, you can specify a one-time transfer, or you can schedule an automated recurring transfer. If you select Once now, select Start Transfer. If you select Repeat… specify your schedule options, then select Schedule Transfer.
Enter the New Source name and select DONE
After your transfer has run, you can see the results of your transfer in the Databases tab.
Use Command-Line to Create Your OneDrive Connection
You can use the TD Console to configure your connection.
Install the Treasure Data Toolbelt
Open a terminal and run the following command to install the newest TD Toolbelt.
Create a Configuration File (seed.yml)
Using a text editor, create a file called seed.yml. Copy and paste the following information replacing the placeholder text with your OneDrive. Note that this configuration dumps all files content from /xxx folder in the target database because “replace” mode is specified. For more details on available out modes, see the Appendix.
If your account is a Business account plan
If your account is a Personal account plan
refresh_token for authenticating with OneDrive. For more information see Appendix. Get your own refresh_token.
Specify the domain_name, server_relative_path, and folder_path. See Appendix where to Domain Name and Relative Path
Run the Guess Fields Command (Generate a load.yml File)
Use the guess command if you are pulling in data from a schema-less database. Run the following command from your terminal:
connector:guess automatically reads the target data and intelligently guesses the data format.
Open the file load.yml. You’ll see guessed file format definitions including, in some cases, file formats, encodings, column names, and types.
Configuration keys and descriptions are as follows:
OneDrive account plan. business | personal
See Get your own refresh_token.
applicable for business accout_type only. When true, imports files directly shared with you (different from shared folders)
applicable for personal account_type only. importing files/folders shared with you.
required when account_type = business.
required when account_type = business.
required when is_shared_root = false
Only import files modified after the specified timestamp. Timezone is UTC. If this parameter is empty, import all files. Timestamp format: yyyy-MM-dd'T'HH:mm:ss.SSSZ E.g. 2019-05-01T10:00:00.121Z
Only import files that match the provided filename regex. See Filename Regex Matching
When running repeatedly, attempts to import files only since the last import.
Optional. A number of retries attempt when an error occurs before the system gives up. Default 7 retries
The time, in milliseconds, between the first and second attempt. Default: 500 milliseconds
The time, in milliseconds, between the second and all subsequent attempts. Default: 300000 ~ 5 minutes
The time, in milliseconds, HTTP connect timeout. Default: 60000 ~ 1 minute
The time, in milliseconds, HTTP read timeout. Default: 300000 ~ 5 minutes.
When the import folder contains too many small files, it would cause degradation in import performance. Set this value to combine multiple files per task for performance gaining. E.g. 4000000 ~ 4MB
An example of load.yml with personal account plan, incremental, and column_options.
Optionally Preview the Data to be Imported
You can preview how the system will parse your data by using a preview command.
guess command needs over 3 rows and 2 columns in source data file, because it guesses column definition using sample rows from source data.
If the system detects your column name or column type unexpectedly, modify `load.yml` directly and runs preview again.
The Data Connector supports parsing of "boolean", "long", "double", "string", and "timestamp" types.
preview command downloads one file from the specified bucket and displays the results from that file. This may cause a difference in results from the preview and issue commands.
Execute the Load Job
You use td connector:issue to execute the job. The following are required:
name of the schedule
database and table where their data will be stored
the Data Connector configuration file
If the database or the table do not exist in TD, td connector:issue will fail.
It’s also recommended to specify --time-column option, because Treasure Data’s storage is partitioned by time (see data partitioning). If the --time-column option is not available, the data connector selects the first long or timestamp column as the partitioning time. The type of the column specified by --time-column must be either of long or timestamp type (use Preview results to check for the available column name and type). A time column is available at the end of the output.
If your data doesn’t have a time column you can add it using the add_time filter. You add the "time" column by adding the add_time filter to your configuration file as follows.
More details at add_time filter plugin.
The integration does not sort records server-side. To use time-based partitioning effectively, sort records in the files before running the import job.
If you have a field called time, you don’t need to specify the --time-column option.
You can schedule periodic data connector execution for periodic OneDrive import. We configure our scheduler carefully to ensure high availability. By using this feature, you no longer need a cron daemon on your local data center.
For the scheduled import, the data connector for OneDrive imports all objects that match the param name_match_pattern. The incremental, and last_modified_time params used to import files get updated, or new files added to folder_path since the last import only. See How Incremental Loading works.
Here’s an example of a seed file using incremental mode combined with ‘append’ mode for output.
Create the Schedule
A new schedule can be created using the td connector:create command. The following are required:
name of the schedule
database and table where their data will be stored
Data Connector configuration file
Remember to specify the --time-column option, because Treasure Data’s storage is usually partitioned by time.
The `cron` parameter also accepts three special options: `@hourly`, `@daily` and `@monthly`.
By default, schedule is setup in UTC timezone. You can set the schedule in a timezone using -t or --timezone option. `--timezone` option only supports extended timezone formats like 'Asia/Tokyo', 'America/Los_Angeles' etc. Timezone abbreviations like PST, CST are *not* supported and may lead to unexpected schedules.
List the Schedules
You can see the list of scheduled entries by running the command td connector:list.
Show the Settings and Schedule History
td connector:show shows the execution settings of a schedule entry.
td connector:history shows the execution history of a schedule entry. To investigate the results of each individual run, use td job <jobid>.
Delete the Schedule
td connector:delete removes the schedule.
FAQ for Import from OneDrive
Q: How do I import a folder that someone shared with me?
Business Account Plan:
You need the domain name, server relative path, and folder path. Go to the Shared With Me tab on the browser:
Select the Shared folder and make note of the domain name, relative path, and folder path as described in Domain Name and Relative Path.
Personal Account Plan
You must set the field: This is a shared folder (is_shared_folder) to true, and drill down to the shared folder as shown in the image.
Select the shared folder f1 and navigate to the f2 target folder to import.
Input the folder path param as shown: folder_path: /f1/f2 (note: folder f0 is not shared with you)
Q: How can I import files from SharePoint collections?
You need a Domain Name, server Relative Path, and Folder Path to import files from SharePoint collections. See Domain Name and Relative Path
Q: Why I got an error "Attribute type is required but not set"?
Your import folder contains empty files or your files do not have a line break character. Delete the empty files, and make sure import files have at least 2 lines of data.
Q: We need to obtain approval from OneDrive admin user for accessing the app. After obtained approval from the admin, the oatuh connection doesn't show up on the connection list
This is because our connector doesn't support such OAuth flow (requesting approval with OAuth). Thus please ask your OneDrive admin user to create authentication on behalf of you.
Modes for the Out Plugin
You can specify file import mode in the out section of the load.yml file.
The out section controls how data is imported into a Treasure Data table.
For example, you may choose to append data or replace data in an existing table in Treasure Data.
Output modes are ways to modify the data as the data is placed in Treasure Data.
Append (default): Records are appended to the target table.
Replace (available In td 0.11.10 and later): Replaces data in the target table. Any manual schema changes made to the target table remain intact.