If I partition a data set, will it be in the correct order when I read it back? For example, consider the following pyspark code:
# read a csv
df = sql_context.read.csv(input_filename)
# add a hash column
hash_udf = udf(lambda customer_id: hash(customer_id) % 4, IntegerType())
df = df.withColumn('hash', hash_udf(df['customer_id']))
# write out to parquet
df.write.parquet(output_path, partitionBy=['hash'])
# read back the file
df2 = sql_context.read.parquet(output_path)
I am partitioning on a customer_id bucket. When I read back the whole data set, are the partitions guaranteed to be merged back together in the original insertion order?
Right now, I'm not so sure, so I'm adding a sequence column:
df = df.withColumn('seq', monotonically_increasing_id())
However, I don't know if this is redundant.
No, it's not guaranteed. Try it with even a tiny data set:
df = spark.createDataFrame([(1,'a'),(2,'b'),(3,'c'),(4,'d')],['customer_id', 'name'])
# add a hash column
hash_udf = udf(lambda customer_id: hash(customer_id) % 4, IntegerType())
df = df.withColumn('hash', hash_udf(df['customer_id']))
# write out to parquet
df.write.parquet("test", partitionBy=['hash'], mode="overwrite")
# read back the file
df2 = spark.read.parquet("test")
df.show()
+-----------+----+----+
|customer_id|name|hash|
+-----------+----+----+
| 1| a| 1|
| 2| b| 2|
| 3| c| 3|
| 4| d| 0|
+-----------+----+----+
df2.show()
+-----------+----+----+
|customer_id|name|hash|
+-----------+----+----+
| 2| b| 2|
| 1| a| 1|
| 4| d| 0|
| 3| c| 3|
+-----------+----+----+