# Databricks Export Integration You can write job results directly to Databricks. For example, you might integrate data from multiple sources into Treasure Data, run queries to organize the data, and then use a job that applies the connector to export the data to Databricks. For import integration, seeĀ [Databricks Import Integration](/int/databricks-import-integration). ## Prerequisites - Basic knowledge of Treasure Data, including the [TD Toolbelt](https://toolbelt.treasuredata.com/). - Basic knowledge of Databricks and access to a Databricks account - Access to a Databricks server ## Requirements and Limitations - If the Databricks server is in hibernation mode, it can take a few minutes to start, which may cause the connector to timeout. - Databricks doesn't support transactions across multiple statements. Thus, the replace and truncate mode can partially fail and can't roll back. For example, In the truncate or replace mode, if the delete or create replace operation is successful but insert data from the temp table into a target table fails, the connector can't roll back to the previous data. ## Use the TD Console to Create your Connection You can use the Treasure Data Console to configure your connection. ### Obtaining the Host Name and HTTP Path for the Databricks Connector 1. Log in to the Databricks website. 2. Navigate to the SQL tab and then select SQL Warehouses. ![](/assets/databricks-export-integration-2024-02-21-3.90cecd1408ee55f95aa56d2175e3afd6ce827e8634cab70f3c8cdd4a97d42f88.a33e2d6c.png) 1. Select an instance of an SQL Warehouse. ![](/assets/databricks-export-integration-2024-02-21-9.1224b9f2be7bed1529fd3e0a6f88f7af06a3f1fd03402079fb5d667c03fb4f49.a33e2d6c.png) 1. Select the Connection details tab. 2. Note the Server hostname and HTTP path for the SQL Warehouse. ![](/assets/databricks-export-integration-2024-02-21-10.a2aee3545eae631d1fb5258bb94502ca35163d216c0c65374ddce0c591a8fec8.a33e2d6c.png) ### Create a new connection Go to [Treasure Data Connections](https://console.treasuredata.com/app/connections/data-sources) 1. Select **Integrations Hub** > **Catalog**. 2. Search for Databricks in the catalog. 3. Hover your mouse over the icon and select **Create Authentication**. 4. Ensure that the Credentials tab is selected, and then enter credential information for the integration. ![](/assets/databricks-export-integration-2024-02-21-1.04e2663633e3ecf2111d90be1e4cdc0fce8e995a98ad680af4cabb8aa03e7c28.a33e2d6c.png) | Parameter | Description | | --- | --- | | Hostname | This is the hostname of the Databricks instance, | | HTTP Path | This is the HTTP Path of the Databricks instance. | | Authentication Method | This is the authentication method: it can be a Basic or Personal Access Token. | | Username | This is the username used to log in. It is only valid when the Authentication Method is Basic. | | Password | This is the password used to log in. It is only valid when the Authentication Method is Basic. | | Token | This is the Databricks personal access token. It is only valid when the Authentication Method is a Personal Access Token. | 1. Select **Continue**. 2. Enter a name for your authentication, and select **Done**. ### Configure the Connection by Specifying the Parameters Create or reuse a query that configures the data connection. Go to the [Treasure Data console](https://console.treasuredata.com/users/sign_in). Go to the [Query Editor](https://console.treasuredata.com/app/queries/editor). Access the query that you plan to use to export data. Sometimes, you need to define the column mapping in the query. For Example: ``` SELECT c_id, c_double, c_long, c_string, c_boolean, c_timestamp, c_json FROM (VALUES (1, 100, 10, 'T4', true, '2018-01-01','{ "name":"John"}'),(2, 100, 99, 'P@#4', false, '2018-01-01', '{ "name":"John"}'),(3, 100.1234, 22, 'C!%^&*', false, '2018-01-01','{ "name":"John"}')) tbl1 (c_id, c_double, c_long, c_string, c_boolean, c_timestamp, c_json) ``` Select Output Results, located at the top of your query editor. ### Configure the connection by specifying the parameters 1. Type the connection name in the search box to filter and select your connection. 2. After you select your Databricks connection, the Configuration or Export Results dialog pane appears: ![](/assets/databricks-export-integration-2024-02-21-2.258c59c45384d358126b1d3a14c47dac59e22c56e54dcc11d96aa78067829cc4.a33e2d6c.png) Specify parameters. The parameters are as follows: - Catalog (required): This is the dataset catalog. - Schema (required): This is the dataset schema. - Table (required): The table to which results will be exported. If it does not exist, a new one will be created. - Mode (required): There are three modes to fulfill results to Databricks table: Append, Overwrite, and Replace. - Append: TD results will be appended to the target table - Overwrite: The target table will be erased, then TD results will be appended from the beginning of the target table - Replace: The target table will be re-created, then TtrD results will be appended from the beginning of the target table - Data type mapping: It will be explained in the following section. ### Data type mapping Here is the table for 1-1 default mapping from the type in TD results to the target table in case the target table does not exist: | TD results | Databricks | | --- | --- | | string | STRING | | double | DOUBLE | | long | BIGINT | | timestamp | TIMESTAMP | | boolean | BOOLEAN | | json | STRING | You might want to set the different type as the default. Data Type Mapping explicitly sets a specific type (for example, STRING) to a specific column. Data Type Mapping applies a type only to columns in your target table. If you use Data Mapping, the following is true: - If the target table does not exist, then the export job will create a new target table. The syntax for the data type mapping parameter is col_name_1: STRING; col_name2: BIGINT. You must provide: the column name and Databricks data type. TD Databricks Export doesn't support BINARY, INTERVAL, STRUCTURE, ARRAY, and MAP data type. For example, in the query: ``` SELECT c_id, c_double, c_long, c_string, c_boolean, c_timestamp, c_json FROM (VALUES (1, 100, 10, 'T4', true, '2018-01-01','{ "name":"John"}'),(2, 100, 99, 'P@#4', false, '2018-01-01', '{ "name":"John"}'),(3, 100.1234, 22, 'C!%^&*', false, '2018-01-01','{ "name":"John"}')) tbl1 (c_id, c_double, c_long, c_string, c_boolean, c_timestamp, c_json) Optional: Use of Scheduled Jobs for Output ``` You can use [Scheduled Jobs](http://docs.treasuredata.com/display/PD/Scheduling+Jobs+Using+TD+Console) with Result Output to periodically write the output result to a specific target destination. ## Optional: Configure Export Results in Workflow Within Treasure Workflow, you can specify the use of this data connector to output data. ```yaml timezone: UTC _export: td: database: sample_datasets +td-result-into-databricks: td>: queries/sample.sql result_connection: your_connection_name result_settings: catalog: CATALOG schema: default table: TARGET_TABLE mode: insert column_options: col1: {type: 'DATE'} ``` ## Use the CLI to create your connection ### Install 'td' command Install the [TD Toolbelt](https://toolbelt.treasuredata.com/). ### For On-demand Jobs Add the Databricks result output destination by using the `-r` / `--result` option for the `td query` command: ```bash td query -d mydb \ -w 'SELECT id, name FROM source_table' \ --type presto \ -r '{"type": "databricks", "host_name": "hostname", "http_path": "httpPath", "auth_method": "BASIC", "username": "username", "password": "password", catalog: "catalog", "schema": "default", "table": "target_table", "mode": "replace", "column_options": {"col1": {"type": "DATE"}}}' ``` Some parameters, such as catalog and schema, are self-explanatory and are the same parameters used in the TD Console (described in the "Configure the connection by specifying the parameters" section). However, some parameters are either different in the key or value: - mode (required): The value for mode is a raw value of the Mode parameter: - insert: TD results will be appended to the target table - truncate_insert: The target table will be erased, then TD results will be appended from the beginning of the target table. - replace: The target table will be re-created, then TtrD results will be appended from the beginning of the target table - column_options: The data type mapping. ### For Scheduled Jobs Add the Databricks result output destination by using the `-r` / `--result` option for the `td sched:create` command: ```bash td sched:create every_6_mins "*/6 * * * *" \ -d mydb \ -w 'SELECT id, name FROM source_table' \ --type presto \ -r '{"type": "databricks", "host_name": "hostname", "http_path": "httpPath", "auth_method": "BASIC", "username": "username", "password": "password", catalog: "catalog", "schema": "default", "table": "target_table", "mode": "replace", "column_options": {"col1": {"type": "DATE"}}}' ```