Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Amazon EMR is an AWS tool for big data processing and analysis, providing an easy-to-use interface for accessing Spark. PySpark is a Python API for Spark. Treasure Data's td-pyspark is a Python library that provides a handy way to use PySpark and Treasure Data based on td-spark.

Table of Contents

Prerequisites

...

When you create the key pair in Amazon, you provide a name, and a file with the extension of .pem is generated. You download the generated file to your local computer.

...

Complete the configuration fields. Provide a cluster name, a folder location for the cluster data, and select version Spark 2.4.3 or later as the Application.

...

Still, within the Master node instance, run the following command to install pyspark:

...

Create a Configuration File and Specify your TD API Key and Site

In the the Master node instance, create a td-spark.conf file. In the configuration file, specify your TD API Key, TD site parameters, and spark Spark environment.

An example of the format is as follows. You provide the actual values:

...

Checking Amazon EMR in Treasure Data

You can use td TD toolbelt to check your database from a command line. Alternatively, if you have TD Console, you can check your databases and queries. Read about Database and Table Management.