# TD Python Spark Driver With Databricks

Treasure Data will no longer accept new users for the Plazma Public API, that is used by td-pyspark driver.Use [pytd library](https://api-docs.treasuredata.com/en/tools/pytd/quickstart/) for the integration instead.

You can use td-pyspark to bridge the results of data manipulations in Databrick with your data in Treasure Data.

Databricks builds on top of Apache Spark providing an easy to use interface for accessing Spark. PySpark is a Python API for Spark. Treasure Data's [td-pyspark](https://pypi.org/project/td-pyspark/) is a Python library that provides a handy way to use PySpark and Treasure Data based on td-spark.

## Prerequisites

To follow the steps in this example, you must have the following times:

* Treasure Data API key
* td-spark feature enabled


## Configuring your Databricks Environment

You create a cluster, install td-pyspark libraries and configure a notebook for your connection code.

### Create a Cluster on Databricks

1. Select the Cluster icon.
2. Select **Create Cluster**.
3. Provide a cluster name, select version Spark 2.4.3 or later as the **Databricks Runtime Version** and select 3 as the **Python Version**.


![](/assets/image1.3ce73e62942f1880b3459942644bd983d9878889e8521488b81e64119261fe0b.60bcc915.png)

### Install the td-pyspark Libraries

Access the Treasure Data Apache Spark Driver Release Notes for additional information and the most current download or select the link below.

1. Select to download


* [td-spark-assembly-latest_spark2.4.7.jar](https://td-spark.s3.amazonaws.com/td-spark-assembly-latest_spark2.4.7.jar) (Spark 2.4.7, Scala 2.11)
* [td-spark-assembly-latest_spark3.0.1.jar](https://td-spark.s3.amazonaws.com/td-spark-assembly-latest_spark3.0.1.jar) (Spark 3.0.1, Scala 2.12)


1. Select PyPI.


![](/assets/image2.647afd229c47d4405e59ecb8623bff5fdc5649570e12e69390a7f1d718f19db6.60bcc915.png)

When the download completes, you see the following:

![](/assets/image3.020caa5ed781bcc7f9d2285ddaf581fb05443fcc4c63972de9e250fcc5702e6a.60bcc915.png)

### Specify your TD API Key and Site

In the Spark configuration, you specify the Treasure Data API key and enter the environment variables.

![](/assets/image4.113acee2e5ea1af141dfeff9f21be666c63c136063f9b110f8723e1e4951cd38.60bcc915.png)

An example of the format is as follows. You provide the actual values:


```python
spark.td.apikey (Your TD API KEY)  
spark.td.site (Your site: us, jp, eu01, ap02)  
spark.serializer org.apache.spark.serializer.KryoSerializer  
spark.sql.execution.arrow.enabled true
```

### Restart Cluster and Begin Work in Databricks

Restart your Spark cluster. Create a notebook. Create a script similar to the following code:


```python
%python  
      
from pyspark.sql import *  
      
import td_pyspark  
      
SAMPLE_TABLE = "sample_datasets.www_access"  
      
td = td_pyspark.TDSparkContext(spark)  
      
df = td.table(SAMPLE_TABLE).within("-10y").df()  
      
df.show()
```

*TDSparkContext* is an entry point to access td_pyspark's functionalities. As shown in the preceding code sample, to create TDSparkContext, pass your SparkSession (spark) to TDSparkContext:


```python
td = TDSparkContext(spark)
```

You see a result similar to the following:

![](/assets/image5.db3371232b119ee3a0a177fa29e4e07deb8524e4da329f53e8d9085321ca3442.60bcc915.png)

Your connection is working.

## Interacting with Treasure Data from Databricks

In Databricks, you can run select and insert queries to Treasure Data or query back data from Treasure Data. You can also create and delete databases and tables.

In Databricks you can use the following commands:

#### **Read Tables as DataFrames**

To read a table, use td.table (table_name):


```pyton
df = td.table("sample_datasets.www_access").df()  
      
df.show()
```

#### **Change the Database Used in Treasure Data**

To change the context database, use td.use (database_name):


```python
td.use("sample_datasets")  
      
# Accesses sample_datasets.www_access  
      
df = td.table("www_access").df()
```

By calling .df() your table data is read as Spark's DataFrame. The usage of the DataFrame is the same with PySpark. See also [PySpark DataFrame documentation](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/index.md).

#### **Access sample_datasets.www_access**


```python
df = td.table("www_access").df()
```

#### **Submit Presto Queries**

If your Spark cluster is small, reading all of the data as in-memory DataFrame might be difficult. In this case, you can use Presto, a distributed SQL query engine, to reduce the amount of data processing with PySpark.


```python
q = td.presto("select code, * from sample_datasets.www_access")  
      
q.show()q = td.presto("select code, count(*) from sample_datasets.www_access group by 1")q.show()
```

You see:

![](/assets/image6.c82e5076cc37c73e25ba59a89c8747e01a573350bd4081208515046f1217216f.60bcc915.png)

![](/assets/image7.858584a0970d59cdcc6ce9ee5f034a853256537e5b205785ce9f8cf40e76538b.60bcc915.png)

#### **Create or Drop a Database**


```python
td.create_database_if_not_exists("<db_name>")  
      
td.drop_database_if_exists("<db_name>")
```

#### **Upload DataFrames to Treasure Data**

To save your local DataFrames as a table, you have two options:

* Insert the records in the input DataFrame to the target table
* Create or replace the target table with the content of the input DataFrame


```python
td.insert_into(df, "mydb.tbl1")td.create_or_replace(df, "mydb.tbl2")
```

## Checking Databricks in Treasure Data

You can use td toolbelt to check your database from a command line. Alternatively, if you have TD Console, you can check your databases and queries. Read about [Database and Table Management](/products/customer-data-platform/data-workbench/databases/data-management).