Search code examples
apache-sparkpysparkapache-spark-sqlparquetazure-synapse

Parquet partition performance with where clause


I'm trying to optimize query performance for a PySpark SQL query of parquet files in Azure Synapse Analytics. My data set is billions of records, so any bit of performance I can get is great. My basic question is does the columnar storage of parquet really help me with my where clause for Year, or must I use the /Year=2023 with the OPENROWSET method to get that real performance boost?

Details

From the Spark job that creates the partitioned parquet files I calculate two fields relevant to this post - Year and event_year. The calculation for both is the same - year(event_date). I create it twice because when I save the parquet files I'm using the partitionBy Year (as well as Month and Day). Apparently the partitionBy function removes the field from the actual output, so event_year is also included in the actual output.

The output goes to /myFiles/events/Year=XXXX/Month=XX/Day=XX

When I query the data with OPENROWSET I can obviously specify directories if I want a particular year, but my goal is to create a single view in Azure Synapse Analytics serverless pool called dbo.events that covers all years (all of /myFiles/events) and users could just query where Year = XXXX to be as efficient as possible. I have years ranging from 2020 through current year, current year and previous year 2023 and 2024 at time of this post.

I come from a traditional SQL world where an index on Year would guarantee me what I'm looking for. However in this serverless pool the results of from dbo.events where event_year = 2023 don't seem to perform any better than the OPENROWSET(/myFiles/events/Year=2023) method.

So my question is - does the columnar storage of parquet really help me with my where event_year = 2023, or must I use the Year=2023 in the OPENROWSET file path to get that real performance boost?

Thanks!


Solution

  • If it's normal parquet you'd probably need to use the directory directly. The adls hadoop implementation (and the hns api it calls) is terrible at scanning for large datasets with that level of nesting. It will scan for files before any predicate pushdown or partition pruning takes place.

    If possible prefer / convert to delta (although with the size of data you hint it's probably tough to do). (moving to delta took a sample set from 40 minutes scan down to <1m as the transaction log already knew the files and paths to look at).

    That said, even after the file scan (which happens before your actual query even starts), your query doesn't relate to the year in the partitions so you'd also have to query for Year (or re-write the plan to do this automatically).

    Statistics wise if the event_year is in the first 10 (default on Parquet I believe, Databricks ups it to 32) the event_year will be the same in each mpart file so, although it would have to read every file (if you don't also query Year), it would only have to read the column statistics at the bottom of the file not the entire file.

    If it's not in the statistics (and Year isn't also queried) then all the files will be scanned fully, which is obviously not ideal.

    In summary, performance will be bad for file scanning unless you use the Year path (which stops cross year queries etc.) and worse if you do not use Year to query (enabling parition pruning after the scan) / event_year is not in the statistics.

    The above is definitely the case for external tables in Databricks (non-warehouse) and the OSS adls driver / hadoop integration, I can't find any documentation on partition relevant optimisations applied by synapse but it's entirely possible they've abstracted that away for parquet tables added to the catalog. Hopefully someone can reply with any synapse specific optimisations.

    edit: re first 10/32 - yes I meant column order, only the first x columns are used to collect statistics against. So in this case if it's the physically last column then all files would be scanned.

    If you can look at the query plan you should see if there are skipped reads or files read changes when you add Year=.

    re change the plan. If you are running the queries your self you could always scan for event_year = x and add year = x to it. Automatically wise is heavier lifting and would depend on if synapse allows extensions - Quality uses extensions to auto insert new query plan transformations, the same may be possible to do on Synapse (but I've no experience in that I'm afraid)