Search code examples
apache-sparkpysparkdata-sciencedistributed-computingdata-processing

Does dropping columns that are not used in computation affect performance in spark?


I have a large dataset (hundreds of millions of rows) that I need to heavily process using spark with Databricks. This dataset has tens of columns, typically an integer, float, or array of integers.

My question is: does it make any difference if I drop some columns that are not needed before processing the data? In terms of memory and/or processing speed?


Solution

  • It depends what are you going to do with this dataset. Spark is smart enough to figure out which column are really needed, but its not always that easy. For example when you use UDF (user defined fucntion) which is operating on case class with all column defined, all column are going to be select from source as from Spark perspective such UDF is a black box.

    You can check which column are selected for your job via SparkUI. For example check out this blog post: https://medium.com/swlh/spark-ui-to-debug-queries-3ba43279efee

    In your plan you can look for this line: PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:string>

    In ReadSchema you will be able to figure out which column are read by Spark and if they are really needed in our processing