java apache-spark apache-spark-sql parquet

Perform group by on RDD in Spark and write each group as individual Parquet file

I have an RDD in memory. I would like to group the RDD using some arbitrary function and then write out each individual group as an individual Parquet file.

For instance, if my RDD is comprised of JSON strings of the form:

{"type":"finish","resolution":"success","csr_id": 214}
{"type":"create","resolution":"failure","csr_id": 321}
{"type":"action","resolution":"success","csr_id": 262}

I would want to group the JSON strings by the "type" property, and write each group of strings with the same "type" to the same Parquet file.

I can see that the DataFrame API enables writing out Parquet files as follows (for instance if the RDD is comprised of JSON Strings):

final JavaRDD<String> rdd = ...
final SQLContext sqlContext = SQLContext.getOrCreate(rdd.context());
final DataFrame dataFrame = sqlContext.read().json(rdd);
dataFrame.write().parquet(location);

This would mean that the entire DataFrame is written to the Parquet file though, so the Parquet file would contain records with different values for the "type" property.

The Dataframe API also supplies a groupBy function:

final GroupedData groupedData = dataFrame.groupBy(this::myFunction);

But the GroupedData API appears to not to provide any function for writing out each group to an individual file.

Any ideas?

Solution

You cannot write GroupedData but you can partition data on write:

dataFrame.write.partitionBy("type").format("parquet").save("/tmp/foo")

Each type will be written to its own directory with ${column}=${value} format. These can be loaded separately:

sqlContext.read.parquet("/tmp/foo/type=action").show
// +------+----------+
// |csr_id|resolution|
// +------+----------+
// |   262|   success|
// +------+----------+