Search code examples
pyarrow

Correct way to specify JSON block size for PyArrow dataset?


Question: when reading newline delimited JSON files as a PyArrow dataset, what is the correct way to specify the JSON block_size?

Context

I have newline-delimited JSON files which I wish to convert to parquet.

This code worked for one set of files:

# ... imports, schema definition

dataset = ds.dataset(
    'data/landing/type_one',
    format='json',
    schema=schema_type_one
)
ds.write_dataset(
    dataset, 
    'data/intermediate/type_one', 
    format='parquet',
    min_rows_per_group=3*512**2,
    max_rows_per_file=5*10**6
)

For another type of files (different data structure), my code was failing with pyarrow.lib.ArrowInvalid: straddling object straddles two block boundaries (try to increase block size?). Some Googling took me to this Github issue, where one of the comments hinted at what my problem was:

We're using the JSON loader of pyarrow. It parses the file chunk by chunk to load the dataset. This issue happens when there's no delimiter in one chunk of data. For json line, the delimiter is the end of line. So with a big value for chunk_size this should have worked unless you have one extremely long line in your file.

And in fact yes, I have some extremely long lines in my second set of files, so I need to increase the block size somehow.

pyarrow.dataset.dataset does not have a way to directly specify block size for JSON, but I was able to get it to work by specifying format as a JsonFileFormat object with specific read_options:

import pyarrow.dataset as ds
from pyarrow._dataset import JsonFileFormat
from pyarrow.json import ReadOptions

# ... schema definitions

fileformat = JsonFileFormat(
    read_options=ReadOptions(block_size=5*2**20)
)

dataset = ds.dataset(
    'data/landing/type_two',
    format=fileformat,
    schema=schema_type_two
)

ds.write_dataset(
    dataset, 
    'data/intermediate/type_two', 
    format='parquet',
    min_rows_per_group=3*512**2,
    max_rows_per_file=5*10**6
)

However, this import makes me uncomfortable: from pyarrow._dataset import JsonFileFormat. Whereas pyarrow.dataset.ParquetFileFormat is documented in the public API, I had to dig through the code to find the existence of pyarrow._dataset.JsonFileFormat which resides in a separate "private" _dataset package.

Is there a more "correct" way to achieve what I'm trying to accomplish?


Solution

  • It appears that at the time of this writing JsonFileFormat is simply missing from the documentation for pyarrow.dataset. The documentation has CsvFileFormat, IpcFileFormat, ParquetFileFormat and OrcFileFormat but not JsonFileFormat.

    In fact however, pyarrow.dataset imports JsonFileFormat along with the other file formats from pyarrow._dataset.

    So instead of doing:

    from pyarrow._dataset import JsonFileFormat
    

    one can directly do

    from pyarrow.dataset import JsonFileFormat
    

    And then the block_size can be specified by providing an appropriately configured pyarrow.Json.ReadOptions object to JsonFileFormat, as shown in the original code in the question.