I am just starting to look at parquet files, since some of my data is available in that format. And I haven't really played with it before, so here's my question.
I open my parquet file like this:
import pyarrow.parquet as pq
table1 = pq.read_table('mydatafile.parquet')
And this file consists of 10 columns. Is it now possible, directly from this, to filter out all rows where e.g. column3 has the value 1?
I mean, I could just do:
df = table1.to_pandas()
df = df[df["column3"] != 1]
But can this be done natively, without converting to a Pandas data frame first?
You can use this syntax from the documentation
import pyarrow.parquet as pq
table1 = pq.read_table('mydatafile.parquet', filters = [('column3', '!=' , 1)])
Using predicates to filter rows from pyarrow.parquet.ParquetDataset