Search code examples
pythonfilesystemsparquetpyarrow

Using in-memory filesystem in `pyarrow` tests


I have some pyarrow Parquet dataset writing code. I want to have an integration test that ensures the file is written correctly. I'd like to do that by writing a small example data chunk to an in-memory filesystem. However, I'm struggling to find a pyarrow-compatible in-memory filesystem interface for Python.

You'll find a snippet of code that has a filesystem variable in it below. I'd like to replace the filesystem variable with an in-memory filesystem that I can later inspect in integration tests programmatically.

import pyarrow.parquet as pq
pq.write_to_dataset(
        score_table,
        root_path=AWS_ZEBRA_OUTPUT_S3_PREFIX,
        filesystem=filesystem,
        partition_cols=[
            EQF_SNAPSHOT_YEAR_PARTITION,
            EQF_SNAPSHOT_MONTH_PARTITION,
            EQF_SNAPSHOT_DAY_PARTITION,
            ZEBRA_COMPUTATION_TIMESTAMP
        ]
    )

Solution

  • In the end, I've manually implemented an instance of the pyarrow.FileSystem ABC. It seems that using mock for testing purposes fails, as pyarrow (not in the most Pythonic way) checks for the type of the filesystem parameter passed to write_to_dataset: https://github.com/apache/arrow/blob/5e201fed061f2a95e66889fa527ae8ef547e9618/python/pyarrow/filesystem.py#L383. I suggest to change the logic in this method to not check for types explicitly (even isinstance would be preferable!) to allow for easier testing.