I am currently using DuckDB to perform data transformation using a parquet file as a source. In this parquet file, I have one column encoded as a string which contains an array of json records:
[{...}, {...}, {...}]
I'd like to manipulate this array of record as if it was a list column of jsons.
I tried reading this column with duckDB scalar function, but unfortunately, this column is not parsed correctly and I cannot set a json_format in the scalar function as I could have done with the read_json_objects
function.
Is there anything that I can do to solve this problem?*
The following is a reproducible example with the input table and what should be the output table.
-- Build the table and populate it with a string
-- column with an array of json records
CREATE TABLE input(data VARCHAR);
INSERT INTO input VALUES ('[{"id": "1", "status":"ok"}, {"id":"2", "status":"ok"}]');
-- Build the output table and populate it with
-- a json column with an list
CREATE TABLE output(data JSON);
INSERT INTO output (data) VALUES (ARRAY['{"id": "1", "status": "ok"}', '{"id": "2", "status": "ok"}']);
You need to use some of DuckDB's text functions for your use case. https://duckdb.org/docs/sql/functions/char
Normally, you can use DuckDB's string_split
to separate your VARCHAR
into a list of VARCHAR
s (or JSON
s in your case). In your example, the comma inside the JSON
conflicts with the comma to separate the JSON
s. My solution might be a bit hacky, but I hope it works for you. I trim off the leftmost and rightmost bracket, then I replace },
with }},
to avoid confusion of the comma separator, then I split the VARCHAR
into a list of JSON
s with },
as the separator.
duckdb> LOAD 'json';
duckdb> CREATE TABLE input(data VARCHAR);
duckdb> INSERT INTO input VALUES ('[{"id": "1", "status":"ok"}, {"id":"2", "status":"ok"}]');
duckdb> CREATE TABLE output_as_json_list(data JSON[]);
duckdb> CREATE TABLE output_as_json(data JSON);
duckdb> INSERT INTO output_as_json_list SELECT string_split(replace(data[2:-1], '},', '}},'), '},') FROM input;
duckdb> SELECT * FROM output_as_json_list;
┌──────────────────────────────────────────────────────────┐
│ data │
╞══════════════════════════════════════════════════════════╡
│ [{"id": "1", "status":"ok"}, {"id":"2", "status":"ok"}] │
└──────────────────────────────────────────────────────────┘
duckdb> INSERT INTO output_as_json SELECT UNNEST(data) FROM output_as_json_list;
duckdb> SELECT * FROM output_as_json;
┌────────────────────────────┐
│ data │
╞════════════════════════════╡
│ {"id": "1", "status":"ok"} │
│ {"id":"2", "status":"ok"} │
└────────────────────────────┘