In Polars / pandas / PyArrow, I can instantiate an object from a dict, e.g.
In [12]: pl.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
Out[12]:
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 4 │
│ 2 ┆ 5 │
│ 3 ┆ 6 │
└─────┴─────┘
Is there a way to do that in DuckDB, without going via pandas / pyarrow / etc.?
duckdb
features a function duckdb.read_json
which should do this by simply streaming the dict
as a json string, but no combination of its various parameters will make it read that dict
the same way polars
does unfortunately. You can rearrange the dict
to match the structure expected for a "unstructured" json format in their documentation pretty easily and then use the duckdb.read_json
function to load it the same way as polars
, which I think the closest you can get with the library as it currently stands.
Here is a demonstration which shows the polars
interpretation we expected, the naive duckdb
interpretation we probably didn't and the transformation necessary to load it like polars
did:
import polars
import duckdb
import io
import json
# Silent requirement for fsspec
data = {'a': [1,2,3], 'b': [4,5,6]}
polars_data = polars.DataFrame(data)
print('polars\' interpretation:')
print(polars_data)
duckdb_data = duckdb.read_json(io.StringIO(json.dumps(data)))
print('duckdb \'s naive interpretation:')
print(duckdb_data)
def transform_to_duckjson(dictdata: dict):
return [{fieldname: fieldrecords[recordnum] for fieldname, fieldrecords in dictdata.items()} for recordnum in range(len(dictdata[next(iter(dictdata.keys()))]))]
duckdb_data2 = duckdb.read_json(io.StringIO(json.dumps(transform_to_duckjson(data))))
print('duckdb \'s interpretation after adjustment:')
print(duckdb_data2)
Which gives me the following output:
polars' interpretation:
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 4 │
│ 2 ┆ 5 │
│ 3 ┆ 6 │
└─────┴─────┘
duckdb 's naive interpretation:
┌───────────┬───────────┐
│ a │ b │
│ int64[] │ int64[] │
├───────────┼───────────┤
│ [1, 2, 3] │ [4, 5, 6] │
└───────────┴───────────┘
duckdb 's interpretation after adjustment:
┌───────┬───────┐
│ a │ b │
│ int64 │ int64 │
├───────┼───────┤
│ 1 │ 4 │
│ 2 │ 5 │
│ 3 │ 6 │
└───────┴───────┘
As a note: Here is the python api reference (scroll down to "read_csv(..." for the relevant information) which doesn't provide any help at all for using the function unfortunately.
Let me know if you have any questions.