Search code examples
parquet

Using Parquet metatada to find specific key


I have a bunch of Parquet files containing data where each row has the form [key, data1, data2, data3,...]. I need to know in which file a certain key is located, without actually opening each file and searching. Is it possible to get this from the Parquet metadata?

The keys are formatted as strings.

I already tried accessing the metadata using PyArrow, but didn't get the data I wanted.


Solution

  • Short answer is no.

    Longer answer: Parquet has two types of metadata that help in eliminating data, min/max statistics and optionally BloomFilters. With these two you can definitively determine if a file does not contain your key, but can't determine if 100% does (unless your key happens to be a min/max value). Pyarrow currently only really exposes row group statistics and doesn't support BloomFilter reading/writing at all.

    Also, if the key is of low enough cardinality then dictionary encoding might be used to encode the column. If all data in a column is dictionary encoded, the it might be possible through some lower level APIs (likely not pyarrow) to retrieve the dictionaries and scan them instead of the entire file.

    If you are in control of the writing process then sorting data based on key/limiting the number of keys per file would help make these methods even more efficient.