As part of a recent HumanGeo effort, I was faced with the challenge of detecting patterns and anomalies in large geospatial datasets using various statistics and machine learning methods. Given the size of the datasets, the speed at which they should be processed along with other project constraints, I knew I had to develop a scalable solution that could easily be deployed to AWS. I preferred to use Apache Spark, given my personal and Humangeo’s positive experiences with it. In addition, we needed to develop a solution quickly, so naturally I turned to Python 3.4. It was already part of our tech stack and let’s be real, Python makes life easier (for the most part). Given these requirements, my quest to discover the best solution quickly led me to Amazon’s Elastic Map Reduce (EMR) service.
For those who aren’t familiar with EMR, it’s basically a scalable, Hadoop based Amazon web service that allows users to quickly and easily launch and manage preconfigured Elastic Compute Cloud EC2 instances. EMR also acts as an interface to the user selected open-sourced analytics frameworks installed on its clusters, making it quick and easy to manage. The icing on the cake was that EMR can be preconfigured to run Spark on launch, whose jobs can be written in Python. The process of creating my Spark jobs, setting up EMR, and running my jobs was a easy…until I hit a few major snags, mostly due to using Python 3.4. Whomp, whomp. Fortunately I was able to solve these problems. This blog post is dedicated to those who are encountering the same if not similar problems.
Obstacle 1: Installing Python dependencies on all nodes
The first obstacle that I had to overcome was that I needed to install the Python dependencies for the job. Since Spark is a distributed environment, each node of the EMR cluster needs to have all the dependencies required to run its jobs. This is usually handled by distributing the job’s *.jar-with-dependencies
(i.e jobs written in Java). However, Python jobs don’t have a jar-with-dependencies
so you have to work around this. Luckily, you can use EMR’s bootstrap scripting to do this. The bootstrap script is run on each node of your EMR cluster after it’s provisioned. It can be used to tweak the stock EMR environment to fit your application’s needs, including installing additional dependencies and modifying configuration. In my case I used the bootstrap script to install the Python dependencies system wide with pip. See the code snippet below.
This can also be accomplished by using spark-submit’s --pyfiles argument
, where you can supply a zip of python dependencies, including .py
files and .egg
s. However, there seems to be a consensus that this approach is messy and unreliable.
Obstacle 2: Configuring EMR with Python 3.4
I found this blog post detailing how to run EMR with Python 3.4. It provides a JSON configuration that basically exports an environment variable that PySpark will use to determine the version of Python under which it will run. Note that this is sufficient for using PySpark on EMR, but spark-submit
will require additional configuration.
Obstacle 3: Copying job config to all nodes
Since the Spark job needs a configuration file, that file needs to be present on all of the nodes in the cluster. AWS CLI provides a command that will copy a file from one AWS location to another. In my case, I needed to copy a file from S3 to all of my EMR nodes. Therefore I placed the copy command in my bootstrap script. Since the bootstrap script is run on all nodes, the config file was copied from S3 to each node in the cluster.
Obstacle 4: Getting data into Spark Job
This was straightforward. The job needs to read from a dump file which contains lines of JSON. I first uploaded the dump file, myFile.dump
to an S3 bucket, myBucket
. Then, I used urllib.request
to read the file from S3 and convert it to a Spark object. Here’s the code snippet.
Obstacle 5: Submitting Python 3.4 Spark Jobs to EMR
This was by far the most time consuming of all the challenges. Before we go forward, let me give a brief explanation of how Spark jobs are submitted to EMR.
EMR runs Spark jobs by adding a “Step” to the EMR cluster. There are multiple steps from which we can choose. In my case, I chose the application type. When a Spark Application type is selected, the “Step” essentially becomes a thin wrapper around the spark-submit command. It provides users the option to specify spark-submit options, i.e. spark-submit command parameters. If no options are specified, EMR uses the default Spark configuration. Additionally, you must provide an application location
In my case, the application location was a Python file on S3. There is an option to specify arguments for the application as well as an action to take upon failure of the application. Upon adding a step, the EMR cluster will attempt to use spark-submit to run the job. One important thing to note about Spark on EMR is that its executors run in Apache Hadoop YARN containers. If you’re more interested in how it works, I highly recommend this awesome post.
Python Probz
After going through the above process, I noticed that the steps kept failing. I used the EMR console to check the stderr logs and noticed that the jobs were being submitted but failing. Here’s an example of the error:
Unfortunately, this isn’t very helpful, as many things could cause the spark submit to fail. So I ssh’d into the master EMR node to investigate. I decided to look at the YARN logs first, since the spark executors run inside YARN containers.
I found this:
I’ll spare you the nitty-gritty details about urllib, but trust that this happens when you try to import urllib.request in versions of Python < 3. This meant Spark/EMR was actually using Python 2.7 despite providing configuration for it to use Python 3.4. I verified that the PySpark Shell was actually using Python 3.4. Then I noticed that the job, which is started via the spark-submit
process, was running on Python 2.7. It made sense that this could be possible, since PySpark and spark-submit
are two different processes. After some research, I found several sources that recommended adding the following to spark-env.sh
:
or manually exporting them from the command line (in my case on each node).
I tried doing both and rerunning the cluster with no luck. I even went down the rabbit hole of trying to modify the PATH
, creating aliases, etc. I even tried creating a virtualenv with Python 3.4 and activating it in my code to no avail.
The Solution
I then examined the YARN documentation and and observed these configuration options: spark.yarn.appMasterEnv
and spark.executorEnv
. Both seemed promising, so I decided to pass them into the Step config via the --conf
parameter, like so:
Added the step. Cluster ran. Job completed. Discrete yet aggressive fist pump at my desk.
Random Helpful Tidbits
Important Locations on EMR nodes:
/usr/lib/spark/config
- spark config/usr/lib/hadoop/config
- hadoop config/var/log/ - logs
hadoop user
- runs spark submit jobs
Commands:
Conclusion
Despite some small headaches, I think EMR is a great service that’s going to make life a lot easier for many. It’s a huge time saver because all user-selected configuration is done by the EMR service and it’s managed by Amazon. The user interfaces make it easy to stand up and configure a cluster. I only encountered some minor configuration troubles, mostly because of my specific use case. For my fellow Spark and Python enthusiasts, I hope this helps. For others, I hope you learned something. For both, thanks for reading, and stay tuned for a follow-up post focusing on the pattern analytics!