Search code examples
python-polars

Polars efficiently extract subset of fields from array of JSON objects (list of structs)


Question: using Polars, how can I more efficiently extract a subset of fields from an array of JSON objects?

Background

I have a large (~ 300GB) jsonlines/ndjson file where each line represents a JSON object which in turn contains nested JSON. I am interested in just a subset of the fields from each JSON object, so my goal is to extract these fields from the ndjson and save as parquet for downstream processing.

I am evaluating two options:

  1. Using Clickhouse (actually chdb)
  2. Using Polars

I've experimented with the Clickhouse approach and have working code that is extremely performant. However, Polars is preferable because it's being used in the rest of my project and my team is already familiar with it, whereas Clickhouse would introduce a new and unfamiliar element.

A simplified view of a single JSON object (i.e. one line of the ndjson, formatted for easier viewing):

{
    "id": "834934509",
    "baseElements":
    [
        {
            "baseElements":
            [
                {
                    "url": "https://acme.com",
                    "owner":
                    {
                        "attributes":
                        [
                            {
                                "type": "name",
                                "display": "Acme Consulting",
                            },
                            {
                                "type": "id",
                                "display": "A345B"
                            }
                        ],
                        "text": "The Acme Consulting Agency London"
                    }
                }
            ],
            "url": "https://baselocation.com/acme"
        },
        {
            "url": "https://baselocation.com/values",
            "valueContent":
            {
                "display": "CUSTOM SPIRIT LIMITED",
                "reference": "Client/529d807b46da995395ad3364cbf37701"
            }
        },
        {
            "url": "https://backuplocation.com/values",
            "valueContent":
            {
                "display": "UNDERMAKER INNOVATION",
                "reference": "Client/08afa4cc57d67bb625d794d60937a770"
            }
        }
    ],
    "standing": "good",
    "deleted": false
}

I want to extract id and standing (pretty straightforward). And from the outer baseElements array I want to get an array with just the url and valueContent fields of the nested objects.

Here's my code:

import polars as pl

lf_sample = pl.scan_ndjson('data/input/myjson.jsonl')\
    .select(
        'id',
        pl.col('baseElements').list.eval(
                pl.struct(
                    pl.element().struct.field('url'), 
                    pl.element().struct.field('valueContent'))
            ).alias('baseElements'),
        'standing'
    )

lf_sample.collect(streaming=True).write_parquet('data/output/myparquet.parquet',
    compression='snappy',
    use_pyarrow=True
)

Functionally, this works as I expect it to.

However, it is significantly slower than the Clickhouse approach. Some stats below using a subset of the full dataset, run on an AWS r6g.2xlarge EC2 instance (8 vCPUs, 64GB RAM) running Amazon Linux 2023:

Sample size (#rows) Clickhouse Polars
10k 0.37s 4.67s
20k 0.52s 18.74s

The complete file is 130 million rows/records. Processing with Clickhouse takes ~ 40m, given the numbers above I have not even attempted with Polars.

If I simplify the Polars code to simply extract the entire baseElements array as-is, the speed is comparable to the Clickhouse approach, so I figure it's the pl.col('baseElements').list.eval(...) that's introducing some inefficiency. I attempted using parallel=True for the list.eval(...) but that seems to make it run even longer.

So, is there a better way to do what I'm attempting with Polars, or should I just resign myself to using Clickhouse?

For reference, here's the Clickhouse approach:

from chdb import session as chs
sess = chs.Session()

query = """
SET flatten_nested=0;
select 
    id,
    baseElements,
    standing,
from file('data/input/myjson.jsonl', 
    'JSONEachRow',
    'id String,
    baseElements Nested(
        url Nullable(String),
        valueContent Tuple(reference Nullable(String))
    ),
    standing String
    '
)
into outfile 'data/output/myparquet.parquet'
truncate
format Parquet
settings output_format_parquet_compression_method='snappy', output_format_parquet_string_as_string=1;
"""
sess.query(query)

Solution

  • I think the issue is that the Clickhouse approach defines a Schema which instructs the parser to discard everything else.

    Whereas the Polars approach is processing the resulting structure, i.e. looping through lists, creating new structs, etc.

    You can pass a schema= instead:

    pl.scan_ndjson(
        'data/input/myjson.jsonl',
        schema = {
            "id": pl.String,
            "baseElements": pl.List(
                pl.Struct({
                    "url": pl.String, 
                    "valueContent": pl.Struct({"display": pl.String, "reference": pl.String})
                })
            ),
            "standing": pl.String
        }
    )