Search code examples
apache-sparkhiveelastic-map-reduceapache-spark-1.6

How to register S3 Parquet files in a Hive Metastore using Spark on EMR


I am using Amazon Elastic Map Reduce 4.7.1, Hadoop 2.7.2, Hive 1.0.0, and Spark 1.6.1.

Use case: I have a Spark cluster used for processing data. That data is stored in S3 as Parquet files. I want tools to be able to query the data using names that are registered in the Hive Metastore (eg, looking up the foo table rather than the parquet.`s3://bucket/key/prefix/foo/parquet` style of doing things). I also want this data to persist for the lifetime of the Hive Metastore (a separate RDS instance) even if I tear down the EMR cluster and spin up a new one connected to the same Metastore.

Problem: if I do something like sqlContext.saveAsTable("foo") that will, by default, create a managed table in the Hive Metastore (see https://spark.apache.org/docs/latest/sql-programming-guide.html). These managed tables copy the data from S3 to HDFS on the EMR cluster, which means the metadata would be useless after tearing down the EMR cluster.


Solution

  • The solution was to register the S3 file as an external table.

    sqlContext.createExternalTable("foo", "s3://bucket/key/prefix/foo/parquet")

    I haven't figured out how to save a file to S3 and register it as an external table all in one shot, but createExternalTable doesn't add too much overhead.