I have some pyarrow
Parquet dataset writing code. I want to have an integration test that ensures the file is written correctly. I'd like to do that by writing a small example data chunk to an in-memory filesystem. However, I'm struggling to find a pyarrow
-compatible in-memory filesystem interface for Python.
You'll find a snippet of code that has a filesystem
variable in it below. I'd like to replace the filesystem
variable with an in-memory filesystem that I can later inspect in integration tests programmatically.
import pyarrow.parquet as pq
pq.write_to_dataset(
score_table,
root_path=AWS_ZEBRA_OUTPUT_S3_PREFIX,
filesystem=filesystem,
partition_cols=[
EQF_SNAPSHOT_YEAR_PARTITION,
EQF_SNAPSHOT_MONTH_PARTITION,
EQF_SNAPSHOT_DAY_PARTITION,
ZEBRA_COMPUTATION_TIMESTAMP
]
)
In the end, I've manually implemented an instance of the pyarrow.FileSystem
ABC. It seems that using mock
for testing purposes fails, as pyarrow
(not in the most Pythonic way) checks for the type of the filesystem
parameter passed to write_to_dataset
: https://github.com/apache/arrow/blob/5e201fed061f2a95e66889fa527ae8ef547e9618/python/pyarrow/filesystem.py#L383. I suggest to change the logic in this method to not check for types explicitly (even isinstance
would be preferable!) to allow for easier testing.