Search code examples
pyarrowapache-arrow

PyArrow usage of __fragment_index and __batch_index


I am working with a huge dataset in parquet files. I need an ability to skip first n rows, without materializing them in memory. Based on the documentation, I should be able to create compute expression using special fields __fragment_index and __batch_index. I am using PyArrow 18.0.

My rough approach (simplified to illustrate the problem):

from pyarrow import dataset as ds
from pyarrow import compute as pc

dataset = ds.dataset(input_dir, format="parquet")

scan_columns['my_column'] = pc.field('my_column') # Arbitrary column with my data
scan_columns['__proj_index'] = (
    pc.add(pc.multiply(pc.field('__fragment_index'), pc.scalar(10000)),
                       pc.field('__batch_index')
))
filter = pc.field('__proj_index') >= pc.scalar(42000042)

batches = dataset.scanner(columns=scan_columns, filter=filter).to_batches()

Unfortunately, this generates error

File "pyarrow/_dataset.pyx", line 399, in pyarrow._dataset.Dataset.scanner
File "pyarrow/_dataset.pyx", line 3557, in pyarrow._dataset.Scanner.from_dataset
File "pyarrow/_dataset.pyx", line 3475, in pyarrow._dataset.Scanner._make_scan_options
File "pyarrow/_dataset.pyx", line 3422, in pyarrow._dataset._populate_builder
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(__fragment_index) in column_name: int64

Documentation at https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html#pyarrow.dataset.Scanner.from_dataset says:

The list of columns or expressions may use the special fields __batch_index (the index of the batch within the fragment), __fragment_index (the index of the fragment within the dataset), __last_in_fragment (whether the batch is last in fragment), and __filename (the name of the source file or a description of the source fragment).

Any insights what may be wrong here?


Solution

  • A known issue: Fully support special fields in Scanner.