I want to remove persons that are born in 1900 and have not died yet.
Code below works, but I need two filters to remove specific rows. Is there a simpler way to remove the rows with one filter?
Minimal code to reproduce:
import pandas as pd
data = [
(1900, None,), # needs to be removed
(1900, 2000,),
(2000, None,),
(2000, 2020,),
]
df = pd.DataFrame(data, columns=['birth', 'death'])
df.to_parquet('test.parquet')
# Rows which do not match the filter predicate will be removed
filters= [
[
('birth', '!=', 1900),
],
[
('birth', '=', 1900),
('death', 'not in', [None]),
]
]
df2 = pd.read_parquet('test.parquet', filters=filters)
df2.head()
Documentation: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html#pyarrow.parquet.read_table
You actually don't need the ('birth', '=', 1900)
condition, you can keep rows that are (NOT BIRTH == 1900) OR (DEATH NOT IN NONE)
, equivalent to NOT (BIRTH == 1900 AND DEATH IN NONE)
:
filters= filters= [[('birth', '!=', 1900)], [('death', 'not in', [None])]]
df2 = pd.read_parquet('test.parquet', filters=filters)
You could also use:
import pyarrow.compute as pc
filters = (pc.field('birth')!=1900) | ~pc.field('death').isin([None])
# or
filters = ~( (pc.field('birth')==1900) & pc.field('death').isin([None]) )
Output:
birth death
0 1900 2000.0
1 2000 NaN
2 2000 2020.0