Search code examples
apache-sparkpartitioningwindow-functions

Does Spark know the partitioning key of a DataFrame?


I want to know if Spark knows the partitioning key of the parquet file and uses this information to avoid shuffles.

Context:

Running Spark 2.0.1 running local SparkSession. I have a csv dataset that I am saving as parquet file on my disk like so:

val df0 = spark
  .read
  .format("csv")
  .option("header", true)
  .option("delimiter", ";")
  .option("inferSchema", false)
  .load("SomeFile.csv"))


val df = df0.repartition(partitionExprs = col("numerocarte"), numPartitions = 42)

df.write
  .mode(SaveMode.Overwrite)
  .format("parquet")
  .option("inferSchema", false)
  .save("SomeFile.parquet")

I am creating 42 partitions by column numerocarte. This should group multiple numerocarte to same partition. I don't want to do partitionBy("numerocarte") at the write time because I don't want one partition per card. It would be millions of them.

After that in another script I read this SomeFile.parquet parquet file and do some operations on it. In particular I am running a window function on it where the partitioning is done on the same column that the parquet file was repartitioned by.

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

val df2 = spark.read
  .format("parquet")
  .option("header", true)
  .option("inferSchema", false)
  .load("SomeFile.parquet")

val w = Window.partitionBy(col("numerocarte"))
.orderBy(col("SomeColumn"))

df2.withColumn("NewColumnName",
      sum(col("dollars").over(w))

After read I can see that the repartition worked as expected and DataFrame df2 has 42 partitions and in each of them are different cards.

Questions:

  1. Does Spark know that the dataframe df2 is partitioned by column numerocarte?
  2. If it knows, then there will be no shuffle in the window function. True?
  3. If it does not know, It will do a shuffle in the window function. True?
  4. If it does not know, how do I tell Spark the data is already partitioned by the right column?
  5. How can I check a partitioning key of DataFrame? Is there a command for this? I know how to check number of partitions but how to see partitioning key?
  6. When I print number of partitions in a file after each step, I have 42 partitions after read and 200 partitions after withColumn which suggests that Spark repartitioned my DataFrame.
  7. If I have two different tables repartitioned with the same column, would the join use that information?

Solution

  • Does Spark know that the dataframe df2 is partitioned by column numerocarte?

    It does not.

    If it does not know, how do I tell Spark the data is already partitioned by the right column?

    You don't. Just because you save data which has been shuffled, it does not mean, that it will be loaded with the same splits.

    How can I check a partitioning key of DataFrame?

    There is no partitioning key once you loaded data, but you can check queryExecution for Partitioner.


    In practice:

    • If you want to support efficient pushdowns on the key, use partitionBy method of DataFrameWriter.
    • If you want a limited support for join optimizations use bucketBy with metastore and persistent tables.

    See How to define partitioning of DataFrame? for detailed examples.