Question: when reading newline delimited JSON files as a PyArrow dataset, what is the correct way to specify the JSON block_size
?
Context
I have newline-delimited JSON files which I wish to convert to parquet.
This code worked for one set of files:
# ... imports, schema definition
dataset = ds.dataset(
'data/landing/type_one',
format='json',
schema=schema_type_one
)
ds.write_dataset(
dataset,
'data/intermediate/type_one',
format='parquet',
min_rows_per_group=3*512**2,
max_rows_per_file=5*10**6
)
For another type of files (different data structure), my code was failing with pyarrow.lib.ArrowInvalid: straddling object straddles two block boundaries (try to increase block size?)
. Some Googling took me to this Github issue, where one of the comments hinted at what my problem was:
We're using the JSON loader of pyarrow. It parses the file chunk by chunk to load the dataset. This issue happens when there's no delimiter in one chunk of data. For json line, the delimiter is the end of line. So with a big value for chunk_size this should have worked unless you have one extremely long line in your file.
And in fact yes, I have some extremely long lines in my second set of files, so I need to increase the block size somehow.
pyarrow.dataset.dataset
does not have a way to directly specify block size for JSON, but I was able to get it to work by specifying format
as a JsonFileFormat
object with specific read_options
:
import pyarrow.dataset as ds
from pyarrow._dataset import JsonFileFormat
from pyarrow.json import ReadOptions
# ... schema definitions
fileformat = JsonFileFormat(
read_options=ReadOptions(block_size=5*2**20)
)
dataset = ds.dataset(
'data/landing/type_two',
format=fileformat,
schema=schema_type_two
)
ds.write_dataset(
dataset,
'data/intermediate/type_two',
format='parquet',
min_rows_per_group=3*512**2,
max_rows_per_file=5*10**6
)
However, this import makes me uncomfortable: from pyarrow._dataset import JsonFileFormat
. Whereas pyarrow.dataset.ParquetFileFormat
is documented in the public API, I had to dig through the code to find the existence of pyarrow._dataset.JsonFileFormat
which resides in a separate "private" _dataset
package.
Is there a more "correct" way to achieve what I'm trying to accomplish?
It appears that at the time of this writing JsonFileFormat
is simply missing from the documentation for pyarrow.dataset
. The documentation has CsvFileFormat
, IpcFileFormat
, ParquetFileFormat
and OrcFileFormat
but not JsonFileFormat
.
In fact however, pyarrow.dataset
imports JsonFileFormat
along with the other file formats from pyarrow._dataset
.
So instead of doing:
from pyarrow._dataset import JsonFileFormat
one can directly do
from pyarrow.dataset import JsonFileFormat
And then the block_size
can be specified by providing an appropriately configured pyarrow.Json.ReadOptions
object to JsonFileFormat
, as shown in the original code in the question.