Search code examples
parquetpartitioningpyarrow

Is it possible to specify the compression when using pyarrow write_dataset?


I would like to be able to control the type of compression used when partitioning (default is snappy).

import numpy.random
import pyarrow as pa
import pyarrow.dataset as ds

data = pa.table(
    {
        "day": numpy.random.randint(1, 31, size=100),
        "month": numpy.random.randint(1, 12, size=100),
        "year": [2000 + x // 10 for x in range(100)],
    }
)


ds.write_dataset(
    data,
    "./tmp/partitioned",
    format="parquet",
    existing_data_behavior="delete_matching",
    partitioning=ds.partitioning(
        pa.schema(
            [
                ("year", pa.int16()),
            ]
        ),
    ),
)


It is not clear to me, from the doc, if that's actually possible


Solution

  • There is an option to specify the file options.

    file_options

    pyarrow.dataset.FileWriteOptions, optional

    FileFormat specific write options, created using the FileFormat.make_write_options() function.

    You can use any of the compression options mentioned in the docs - snappy, gzip, brotli, zstd, lz4, none

    Below code writes dataset using brotli compression.

    import numpy.random
    import pyarrow as pa
    import pyarrow.dataset as ds
    
    data = pa.table(
        {
            "day": numpy.random.randint(1, 31, size=100),
            "month": numpy.random.randint(1, 12, size=100),
            "year": [2000 + x // 10 for x in range(100)],
        }
    )
    
    
    file_options = ds.ParquetFileFormat().make_write_options(compression='brotli')
    
    ds.write_dataset(
        data,
        "./tmp/partitioned",
        format="parquet",
        existing_data_behavior="delete_matching",
        file_options=file_options,
        partitioning=ds.partitioning(
            pa.schema(
                [
                    ("year", pa.int16()),
                ]
            ),
        ),
    )
    

    enter image description here