Apache Spark Driver on Amazon EMR (Private Alpha)

This article explains how to use Treasure data’s Apache Spark Driver on Amazon Elastic MapReduce (EMR).

Untitled-3
This feature is in ALPHA stage, and the access is disabled by default. We're looking for customers who know Apache Spark well and are willing to try this feature and give feedback to our team. If you're interested, please contact product@treasure-data.com.

Table of Contents

Create an EMR Spark cluster

  • Create an EMR cluster with Spark support. Using us-east region is highly recommended to maximize data transfer performance from S3.
    EMRSpark

  • Check the master node address of the new EMR EMR address

If you created EMR with default security group (ElasticMapReduce-master), please make sure to permit inbound access from your environment. Please refer “Amazon EMR-Managed Security Groups”.

Other references

Log-in to the EMR Cluster

Connect to EMR Master node with SSH

# Use 8157 for SOCKS5 proxy port so that you can access EMR Spark job history page (port 18080), Zeppelin note book (port 8890), etc.
$ ssh -i (your AWS key pair file. .pem) -D8157 hadoop@ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com
     __|  __|_  )
       _|  (     /   Amazon Linux AMI
      ___|\___|___|
https://aws.amazon.com/amazon-linux-ami/2016.09-release-notes/
4 package(s) needed for security, out of 6 available
Run "sudo yum update" to apply all updates.

EEEEEEEEEEEEEEEEEEEE MMMMMMMM           MMMMMMMM RRRRRRRRRRRRRRR
E::::::::::::::::::E M:::::::M         M:::::::M R::::::::::::::R
EE:::::EEEEEEEEE:::E M::::::::M       M::::::::M R:::::RRRRRR:::::R
  E::::E       EEEEE M:::::::::M     M:::::::::M RR::::R      R::::R
  E::::E             M::::::M:::M   M:::M::::::M   R:::R      R::::R
  E:::::EEEEEEEEEE   M:::::M M:::M M:::M M:::::M   R:::RRRRRR:::::R
  E::::::::::::::E   M:::::M  M:::M:::M  M:::::M   R:::::::::::RR
  E:::::EEEEEEEEEE   M:::::M   M:::::M   M:::::M   R:::RRRRRR::::R
  E::::E             M:::::M    M:::M    M:::::M   R:::R      R::::R
  E::::E       EEEEE M:::::M     MMM     M:::::M   R:::R      R::::R
EE:::::EEEEEEEE::::E M:::::M             M:::::M   R:::R      R::::R
E::::::::::::::::::E M:::::M             M:::::M RR::::R      R::::R
EEEEEEEEEEEEEEEEEEEE MMMMMMM             MMMMMMM RRRRRRR      RRRRRR

Set Up TD Spark Integration

Download td-spark jar file:

[hadoop@ip-x-x-x-x]$ wget https://s3.amazonaws.com/td-spark/td-spark-assembly-0.1.jar

Create a td.conf file in the master node:

# Describe your TD API key here
spark.td.apikey=(your TD API key)
# (recommended) this use KryoSerializer for faster performance
spark.serializer=org.apache.spark.serializer.KryoSerializer

Using spark-shell on EMR

[hadoop@ip-x-x-x-x]$ spark-shell --master yarn --jars td-spark-assembly-0.1.jar --properties-file td.conf
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.1
      /_/
scala> import com.treasuredata.spark._
scala> val td = spark.td
scala> val d = td.df("sample_datasets.www_access")
scala> d.show
+----+---------------+--------------------+--------------------+----+--------------------+----+------+----------+
|user|           host|                path|             referer|code|               agent|size|method|      time|
+----+---------------+--------------------+--------------------+----+--------------------+----+------+----------+
|null|136.162.131.221|    /category/health|   /category/cameras| 200|Mozilla/5.0 (Wind...|  77|   GET|1412373596|
|null| 172.33.129.134|      /category/toys|   /item/office/4216| 200|Mozilla/5.0 (comp...| 115|   GET|1412373585|
|null| 220.192.77.135|  /category/software|                   -| 200|Mozilla/5.0 (comp...| 116|   GET|1412373574|
+----+---------------+--------------------+--------------------+----+--------------------+----+------+----------+
only showing top 3 rows

Using Zeppelin Notebook on EMR

Configure Zeppelin for td-spark

Create SSH Tunnel to EMR Cluster:

$ ssh -i (your AWS key pair file. .pem) -D8157 hadoop@ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com
  • (For Chrome users) Install Proxy Switchy Sharp Chrome Extension Proxy Config
    • Turn on proxy-switch for emr when accessing your EMR master
  • Open http://(your EMR master node public address):8890/
  • Configure td-spark at Interpreters page Zeppelin Config

Access Dataset in TD as DataFrame

  • Read table data as Spark DataFrame Dataframe

Running Presto Queries

Presto

Checking Spark History Server

  • Open http://(your EMR master node public address):18080/ history server

Last modified: Jan 11 2017 08:15:07 UTC

If this article is incorrect or outdated, or omits critical information, please let us know. For all other issues, please see our support channels.