Is it possible to specify the compression when using pyarrow write_dataset?

I would like to be able to control the type of compression used when partitioning (default is snappy).

import numpy.random
import pyarrow as pa
import pyarrow.dataset as ds

data = pa.table(
    {
        "day": numpy.random.randint(1, 31, size=100),
        "month": numpy.random.randint(1, 12, size=100),
        "year": [2000 + x // 10 for x in range(100)],
    }
)


ds.write_dataset(
    data,
    "./tmp/partitioned",
    format="parquet",
    existing_data_behavior="delete_matching",
    partitioning=ds.partitioning(
        pa.schema(
            [
                ("year", pa.int16()),
            ]
        ),
    ),
)

It is not clear to me, from the doc, if that's actually possible

Solution

There is an option to specify the file options.

file_options

pyarrow.dataset.FileWriteOptions, optional

FileFormat specific write options, created using the FileFormat.make_write_options() function.

You can use any of the compression options mentioned in the docs - snappy, gzip, brotli, zstd, lz4, none

Below code writes dataset using brotli compression.

import numpy.random
import pyarrow as pa
import pyarrow.dataset as ds

data = pa.table(
    {
        "day": numpy.random.randint(1, 31, size=100),
        "month": numpy.random.randint(1, 12, size=100),
        "year": [2000 + x // 10 for x in range(100)],
    }
)


file_options = ds.ParquetFileFormat().make_write_options(compression='brotli')

ds.write_dataset(
    data,
    "./tmp/partitioned",
    format="parquet",
    existing_data_behavior="delete_matching",
    file_options=file_options,
    partitioning=ds.partitioning(
        pa.schema(
            [
                ("year", pa.int16()),
            ]
        ),
    ),
)