Search code examples
apache-sparkapache-spark-sqlparquet

Is querying against a Spark DataFrame based on CSV faster than one based on Parquet?


I have to load up a CSV file from HDFS using Spark into DataFrame. I was wondering if there is a "performance" improvement (query speed) from a DataFrame backed by a CSV file vs one backed by a parquet file?

Typically, I load a CSV file like the following into a data frame.

val df1 = sqlContext.read
     .format("com.databricks.spark.csv")
     .option("header", "true")
     .option("inferSchema", "true")
     .load

("hdfs://box/path/to/file.csv")

On the other hand, loading a parquet file (assuming I've parsed the CSV file, created a schema, and saved it to HDFS) looks like the following.

val df2 = sqlContext.read.parquet("hdfs://box/path/to/file.parquet")

Now I'm wondering if operations like the following query times would be impacted and/or different.

  • df1.where("col1='some1'").count()
  • df1.where("col1='some1' and col2='some2'").count()

I'm wondering if anyone knows if there is predicate-pushdown for parquet?

To me, it seems parquet is somewhat like an inverted-index, and it would be expected that simple filters for count would be faster for a data frame based on parquet than one on CSV. As for the CSV-backed data frame, I would imagine that a full data set scan would have to occur each time we filter for items.

Any clarifications on CSV vs parquet-backed data frames query performance is appreciated. Also, any file format that will help in speeding up query counts in data frames is also welcomed.


Solution

  • CSV is a row-oriented format, while Parquet is a column-oriented format.

    Typically row-oriented formats are more efficient for queries that either must access most of the columns, or only read a fraction of the rows. Column-oriented formats, on the other hand, are usually more efficient for queries that need to read most of the rows, but only have to access a fraction of the columns. Analytical queries typically fall in the latter category, while transactional queries are more often in the first category.

    Additionally, CSV is a text-based format, which can not be parsed as efficiently as a binary format. This makes CSV even slower. A typical column-oriented format on the other hand is not only binary, but also allows more efficient compression, which leads to smaller disk usage and faster access. I recommend reading the Introduction section of The Design and Implementation of Modern Column-Oriented Database Systems.

    Since the Hadoop ecosystem is for analytical queries, Parquet is generally a better choice for performance than CSV for Hadoop applications.