Search code examples
pythonpandasparquetboolean-logic

Boolean logic in filters when loading parquet file


I want to remove persons that are born in 1900 and have not died yet.

Code below works, but I need two filters to remove specific rows. Is there a simpler way to remove the rows with one filter?

Minimal code to reproduce:

import pandas as pd

data = [
    (1900, None,),  # needs to be removed
    (1900, 2000,),
    (2000, None,),
    (2000, 2020,),
]
df = pd.DataFrame(data, columns=['birth', 'death'])
df.to_parquet('test.parquet')

# Rows which do not match the filter predicate will be removed
filters= [
    [
        ('birth', '!=', 1900),
    ],
    [
        ('birth', '=', 1900),
        ('death', 'not in', [None]),
    ]
]

df2 = pd.read_parquet('test.parquet', filters=filters)
df2.head()

Documentation: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html#pyarrow.parquet.read_table


Solution

  • You actually don't need the ('birth', '=', 1900) condition, you can keep rows that are (NOT BIRTH == 1900) OR (DEATH NOT IN NONE), equivalent to NOT (BIRTH == 1900 AND DEATH IN NONE):

    filters= filters= [[('birth', '!=', 1900)], [('death', 'not in', [None])]]
    
    df2 = pd.read_parquet('test.parquet', filters=filters)
    

    You could also use:

    import pyarrow.compute as pc
    filters = (pc.field('birth')!=1900) | ~pc.field('death').isin([None])
    
    # or 
    filters = ~( (pc.field('birth')==1900) & pc.field('death').isin([None]) )
    

    Output:

       birth   death
    0   1900  2000.0
    1   2000     NaN
    2   2000  2020.0