Search code examples
databricksparquet

Get sizes of individual columns of delta/parquet table


I would like to check how each column of parquet data contributes to total file size / total table size.

I looked through Spark/Databricks commands, parquet-cli, parquet-tools and unfortunately it seems that none of them provide such information directly. Considering that this is a columnar format, it should be possible to pull out somehow.

So far the closest I got would be to run parquet-tools meta, summing up details by column for each row group within the file, then aggregating this for all files of a table. This means iterating on all parquet files and cumbersome parsing of the output.

Maybe there is an easier way?


Solution

  • Your approach is correct. Here is a py script using DuckDB to find overall compressed and uncompressed size of all the columns in a parquet dataset.

    import duckdb
    con = duckdb.connect(database=':memory:')
    print(con.execute("""SELECT SUM(total_compressed_size) AS 
      total_compressed_size_in_bytes, SUM(total_uncompressed_size) AS
      total_uncompressed_size_in_bytes, path_in_schema AS column_name from
      parquet_metadata('D:\\dev\\tmp\\parq_dataset\\*') GROUP BY path_in_schema""").df())
    

    D:\\dev\\tmp\\parq_dataset\\* here parq_dataset consists of multiple parquet files with same schema. Something similar should be possible using other libraries like pyarrow/fastparquet as well.