I want to know if Spark knows the partitioning key of the parquet file and uses this information to avoid shuffles.
Context:
Running Spark 2.0.1 running local SparkSession. I have a csv dataset that I am saving as parquet file on my disk like so:
val df0 = spark
.read
.format("csv")
.option("header", true)
.option("delimiter", ";")
.option("inferSchema", false)
.load("SomeFile.csv"))
val df = df0.repartition(partitionExprs = col("numerocarte"), numPartitions = 42)
df.write
.mode(SaveMode.Overwrite)
.format("parquet")
.option("inferSchema", false)
.save("SomeFile.parquet")
I am creating 42 partitions by column numerocarte
. This should group multiple numerocarte
to same partition. I don't want to do partitionBy("numerocarte") at the write
time because I don't want one partition per card. It would be millions of them.
After that in another script I read this SomeFile.parquet
parquet file and do some operations on it. In particular I am running a window function
on it where the partitioning is done on the same column that the parquet file was repartitioned by.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val df2 = spark.read
.format("parquet")
.option("header", true)
.option("inferSchema", false)
.load("SomeFile.parquet")
val w = Window.partitionBy(col("numerocarte"))
.orderBy(col("SomeColumn"))
df2.withColumn("NewColumnName",
sum(col("dollars").over(w))
After read
I can see that the repartition
worked as expected and DataFrame df2
has 42 partitions and in each of them are different cards.
Questions:
df2
is partitioned by column numerocarte
?DataFrame
? Is there a command for this? I know how to check number of partitions but how to see partitioning key? read
and 200 partitions after withColumn
which suggests that Spark repartitioned my DataFrame
.Does Spark know that the dataframe df2 is partitioned by column numerocarte?
It does not.
If it does not know, how do I tell Spark the data is already partitioned by the right column?
You don't. Just because you save data which has been shuffled, it does not mean, that it will be loaded with the same splits.
How can I check a partitioning key of DataFrame?
There is no partitioning key once you loaded data, but you can check queryExecution
for Partitioner
.
In practice:
partitionBy
method of DataFrameWriter
.bucketBy
with metastore and persistent tables.See How to define partitioning of DataFrame? for detailed examples.