Question: using Polars, how can I more efficiently extract a subset of fields from an array of JSON objects?
Background
I have a large (~ 300GB) jsonlines/ndjson file where each line represents a JSON object which in turn contains nested JSON. I am interested in just a subset of the fields from each JSON object, so my goal is to extract these fields from the ndjson and save as parquet for downstream processing.
I am evaluating two options:
I've experimented with the Clickhouse approach and have working code that is extremely performant. However, Polars is preferable because it's being used in the rest of my project and my team is already familiar with it, whereas Clickhouse would introduce a new and unfamiliar element.
A simplified view of a single JSON object (i.e. one line of the ndjson, formatted for easier viewing):
{
"id": "834934509",
"baseElements":
[
{
"baseElements":
[
{
"url": "https://acme.com",
"owner":
{
"attributes":
[
{
"type": "name",
"display": "Acme Consulting",
},
{
"type": "id",
"display": "A345B"
}
],
"text": "The Acme Consulting Agency London"
}
}
],
"url": "https://baselocation.com/acme"
},
{
"url": "https://baselocation.com/values",
"valueContent":
{
"display": "CUSTOM SPIRIT LIMITED",
"reference": "Client/529d807b46da995395ad3364cbf37701"
}
},
{
"url": "https://backuplocation.com/values",
"valueContent":
{
"display": "UNDERMAKER INNOVATION",
"reference": "Client/08afa4cc57d67bb625d794d60937a770"
}
}
],
"standing": "good",
"deleted": false
}
I want to extract id
and standing
(pretty straightforward). And from the outer baseElements
array I want to get an array with just the url
and valueContent
fields of the nested objects.
Here's my code:
import polars as pl
lf_sample = pl.scan_ndjson('data/input/myjson.jsonl')\
.select(
'id',
pl.col('baseElements').list.eval(
pl.struct(
pl.element().struct.field('url'),
pl.element().struct.field('valueContent'))
).alias('baseElements'),
'standing'
)
lf_sample.collect(streaming=True).write_parquet('data/output/myparquet.parquet',
compression='snappy',
use_pyarrow=True
)
Functionally, this works as I expect it to.
However, it is significantly slower than the Clickhouse approach. Some stats below using a subset of the full dataset, run on an AWS r6g.2xlarge
EC2 instance (8 vCPUs, 64GB RAM) running Amazon Linux 2023:
Sample size (#rows) | Clickhouse | Polars |
---|---|---|
10k | 0.37s | 4.67s |
20k | 0.52s | 18.74s |
The complete file is 130 million rows/records. Processing with Clickhouse takes ~ 40m, given the numbers above I have not even attempted with Polars.
If I simplify the Polars code to simply extract the entire baseElements
array as-is, the speed is comparable to the Clickhouse approach, so I figure it's the pl.col('baseElements').list.eval(...)
that's introducing some inefficiency. I attempted using parallel=True
for the list.eval(...)
but that seems to make it run even longer.
So, is there a better way to do what I'm attempting with Polars, or should I just resign myself to using Clickhouse?
For reference, here's the Clickhouse approach:
from chdb import session as chs
sess = chs.Session()
query = """
SET flatten_nested=0;
select
id,
baseElements,
standing,
from file('data/input/myjson.jsonl',
'JSONEachRow',
'id String,
baseElements Nested(
url Nullable(String),
valueContent Tuple(reference Nullable(String))
),
standing String
'
)
into outfile 'data/output/myparquet.parquet'
truncate
format Parquet
settings output_format_parquet_compression_method='snappy', output_format_parquet_string_as_string=1;
"""
sess.query(query)
I think the issue is that the Clickhouse approach defines a Schema which instructs the parser to discard everything else.
Whereas the Polars approach is processing the resulting structure, i.e. looping through lists, creating new structs, etc.
You can pass a schema=
instead:
pl.scan_ndjson(
'data/input/myjson.jsonl',
schema = {
"id": pl.String,
"baseElements": pl.List(
pl.Struct({
"url": pl.String,
"valueContent": pl.Struct({"display": pl.String, "reference": pl.String})
})
),
"standing": pl.String
}
)