I am working with a huge dataset in parquet files. I need an ability to skip first n rows, without materializing them in memory. Based on the documentation, I should be able to create compute expression using special fields __fragment_index and __batch_index. I am using PyArrow 18.0.
My rough approach (simplified to illustrate the problem):
from pyarrow import dataset as ds
from pyarrow import compute as pc
dataset = ds.dataset(input_dir, format="parquet")
scan_columns['my_column'] = pc.field('my_column') # Arbitrary column with my data
scan_columns['__proj_index'] = (
pc.add(pc.multiply(pc.field('__fragment_index'), pc.scalar(10000)),
pc.field('__batch_index')
))
filter = pc.field('__proj_index') >= pc.scalar(42000042)
batches = dataset.scanner(columns=scan_columns, filter=filter).to_batches()
Unfortunately, this generates error
File "pyarrow/_dataset.pyx", line 399, in pyarrow._dataset.Dataset.scanner
File "pyarrow/_dataset.pyx", line 3557, in pyarrow._dataset.Scanner.from_dataset
File "pyarrow/_dataset.pyx", line 3475, in pyarrow._dataset.Scanner._make_scan_options
File "pyarrow/_dataset.pyx", line 3422, in pyarrow._dataset._populate_builder
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(__fragment_index) in column_name: int64
Documentation at https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html#pyarrow.dataset.Scanner.from_dataset says:
The list of columns or expressions may use the special fields __batch_index (the index of the batch within the fragment), __fragment_index (the index of the fragment within the dataset), __last_in_fragment (whether the batch is last in fragment), and __filename (the name of the source file or a description of the source fragment).
Any insights what may be wrong here?
A known issue: Fully support special fields in Scanner.