Search code examples
jsonparquetduckdb

How can I cast a string column containing an array of json records to a list-column in DuckDB?


I am currently using DuckDB to perform data transformation using a parquet file as a source. In this parquet file, I have one column encoded as a string which contains an array of json records:

[{...}, {...}, {...}]

I'd like to manipulate this array of record as if it was a list column of jsons.

I tried reading this column with duckDB scalar function, but unfortunately, this column is not parsed correctly and I cannot set a json_format in the scalar function as I could have done with the read_json_objects function.

Is there anything that I can do to solve this problem?*


The following is a reproducible example with the input table and what should be the output table.

-- Build the table and populate it with a string 
-- column with an array of json records

CREATE TABLE input(data VARCHAR); 
INSERT INTO input VALUES ('[{"id": "1", "status":"ok"}, {"id":"2", "status":"ok"}]');

-- Build the output table and populate it with 
-- a json column with an list

CREATE TABLE output(data JSON);
INSERT INTO output (data) VALUES (ARRAY['{"id": "1", "status": "ok"}', '{"id": "2", "status": "ok"}']);

Solution

  • You need to use some of DuckDB's text functions for your use case. https://duckdb.org/docs/sql/functions/char

    Normally, you can use DuckDB's string_split to separate your VARCHAR into a list of VARCHARs (or JSONs in your case). In your example, the comma inside the JSON conflicts with the comma to separate the JSONs. My solution might be a bit hacky, but I hope it works for you. I trim off the leftmost and rightmost bracket, then I replace }, with }}, to avoid confusion of the comma separator, then I split the VARCHAR into a list of JSONs with }, as the separator.

    duckdb> LOAD 'json';
    duckdb> CREATE TABLE input(data VARCHAR);
    duckdb> INSERT INTO input VALUES ('[{"id": "1", "status":"ok"}, {"id":"2", "status":"ok"}]');
    duckdb> CREATE TABLE output_as_json_list(data JSON[]);
    duckdb> CREATE TABLE output_as_json(data JSON);
    
    duckdb> INSERT INTO output_as_json_list SELECT string_split(replace(data[2:-1], '},', '}},'), '},') FROM input;
    duckdb> SELECT * FROM output_as_json_list;
    ┌──────────────────────────────────────────────────────────┐
    │ data                                                     │
    ╞══════════════════════════════════════════════════════════╡
    │ [{"id": "1", "status":"ok"},  {"id":"2", "status":"ok"}] │
    └──────────────────────────────────────────────────────────┘
    
    duckdb> INSERT INTO output_as_json SELECT UNNEST(data) FROM output_as_json_list;
    duckdb> SELECT * FROM output_as_json;
    ┌────────────────────────────┐
    │ data                       │
    ╞════════════════════════════╡
    │ {"id": "1", "status":"ok"} │
    │  {"id":"2", "status":"ok"} │
    └────────────────────────────┘