If I have a partitioned data and I was to filter using the filters
argument in pd.read_parquet
how can I accomplish that? For example:
import pandas as pd
data = {
"ID": [1, 2, 3],
"Value": ["A", "B", "C"]
}
df = pd.DataFrame(data)
parquet_folder = "example_partitioned"
df.to_parquet(parquet_folder, index=False, partition_cols=["Value"])
So I have partitioned data structure on disk. If I construct a filter condition like this it works:
filter_conditions = [
("Value", "==", "A")
]
pd.read_parquet(parquet_folder, filters=filter_conditions)
But if I want multiple conditions (i.e. A OR B) the following does not work:
filter_conditions_two = [
("Value", "==", "A"),
("Value", "==", "B")
]
pd.read_parquet(parquet_folder, filters=filter_conditions_two)
That instead returns a empty data frame. Is this possible with filters?
You can do it this way:
import pandas as pd
data = {
"ID": [1, 2, 3],
"Value": ["A", "B", "C"]
}
df = pd.DataFrame(data)
parquet_folder = "example_partitioned"
df.to_parquet(parquet_folder, index=False, partition_cols=["Value"])
filter_conditions = [("Value", "in", ["A", "B"])]
filtered_df = pd.read_parquet(parquet_folder, filters=filter_conditions)
print(filtered_df)
which gives
ID Value
0 1 A
1 2 B
In the same way, if your dataframe had another column Category
on which you also applied a partition, you would do
filter_conditions = [("Value", "in", ["A", "B"]), ("Category", "A1", "A2
")]