apache-spark hive elastic-map-reduce apache-spark-1.6

How to register S3 Parquet files in a Hive Metastore using Spark on EMR

I am using Amazon Elastic Map Reduce 4.7.1, Hadoop 2.7.2, Hive 1.0.0, and Spark 1.6.1.

Use case: I have a Spark cluster used for processing data. That data is stored in S3 as Parquet files. I want tools to be able to query the data using names that are registered in the Hive Metastore (eg, looking up the foo table rather than the parquet.`s3://bucket/key/prefix/foo/parquet` style of doing things). I also want this data to persist for the lifetime of the Hive Metastore (a separate RDS instance) even if I tear down the EMR cluster and spin up a new one connected to the same Metastore.

Problem: if I do something like sqlContext.saveAsTable("foo") that will, by default, create a managed table in the Hive Metastore (see https://spark.apache.org/docs/latest/sql-programming-guide.html). These managed tables copy the data from S3 to HDFS on the EMR cluster, which means the metadata would be useless after tearing down the EMR cluster.

Solution

The solution was to register the S3 file as an external table.

sqlContext.createExternalTable("foo", "s3://bucket/key/prefix/foo/parquet")

I haven't figured out how to save a file to S3 and register it as an external table all in one shot, but createExternalTable doesn't add too much overhead.