Search code examples
pythonpandasparquet

Multiple conditions for filters on partitioned columns with pandas read_parquet


If I have a partitioned data and I was to filter using the filters argument in pd.read_parquet how can I accomplish that? For example:

import pandas as pd


data = {
    "ID": [1, 2, 3],
    "Value": ["A", "B", "C"]
}
df = pd.DataFrame(data)


parquet_folder = "example_partitioned"
df.to_parquet(parquet_folder, index=False, partition_cols=["Value"])

So I have partitioned data structure on disk. If I construct a filter condition like this it works:

filter_conditions = [
    ("Value", "==", "A")
]


pd.read_parquet(parquet_folder, filters=filter_conditions)

But if I want multiple conditions (i.e. A OR B) the following does not work:

filter_conditions_two = [
    ("Value", "==", "A"),
    ("Value", "==", "B")
]

pd.read_parquet(parquet_folder, filters=filter_conditions_two)

That instead returns a empty data frame. Is this possible with filters?


Solution

  • You can do it this way:

    import pandas as pd
    
    data = {
        "ID": [1, 2, 3],
        "Value": ["A", "B", "C"]
    }
    
    df = pd.DataFrame(data)
    
    parquet_folder = "example_partitioned"
    
    df.to_parquet(parquet_folder, index=False, partition_cols=["Value"])
    
    filter_conditions = [("Value", "in", ["A", "B"])]
    filtered_df = pd.read_parquet(parquet_folder, filters=filter_conditions)
    print(filtered_df)
    

    which gives

       ID Value
    0   1     A
    1   2     B
    

    In the same way, if your dataframe had another column Category on which you also applied a partition, you would do

    filter_conditions = [("Value", "in", ["A", "B"]), ("Category", "A1", "A2
    ")]