I am trying to filter pyarrow data with pyarrow.dataset. I want to add a dynamic way to add to the expressions.
from pyarrow import parquet as pq
import pyarrow.dataset as ds
import datetime
exp1 = ds.field("IntCol") == 1
exp2 = ds.field("StrCol") == 'A'
exp3 = ds.field("DateCol") == datetime.date.today()
filters = (exp1 & exp2 & exp3)
print(filters)
#To be used in reading parquet tables
df = pq.read_table('sample.parquet', filters=filters)
How can do this without writing "&" there since I may have N number of exps? I have been looking at different ways to collect expressions like np.logical_and.accumulate(). It gets me partially there, but I still need to convert the array into a single expression.
np.logical_and.accumulate([exp1, exp2, exp3])
out: array([<pyarrow.dataset.Expression (IntCol == 1)>,
<pyarrow.dataset.Expression (StrCol == "A")>,
<pyarrow.dataset.Expression (DateCol == 2021-06-09)>], dtype=object)
going down numpy route may not be the best answer. Does anyone have suggestion whether this can be done?
You can use operator.and_
to have the functional equivalent of the &
operator. And then with functools.reduce
it can be recursively applied on a list of expressions.
Using your three example expressions:
import operator
import functools
>>> functools.reduce(operator.and_, [exp1, exp2, exp3])
<pyarrow.dataset.Expression (((IntCol == 1) and (StrCol == "A")) and (DateCol == 2021-06-10))>