Search code examples
pythonduckdb

DuckDBPyRelation from Python dict?


In Polars / pandas / PyArrow, I can instantiate an object from a dict, e.g.

In [12]: pl.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
Out[12]:
shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 4   │
│ 2   ┆ 5   │
│ 3   ┆ 6   │
└─────┴─────┘

Is there a way to do that in DuckDB, without going via pandas / pyarrow / etc.?


Solution

  • duckdb features a function duckdb.read_json which should do this by simply streaming the dict as a json string, but no combination of its various parameters will make it read that dict the same way polars does unfortunately. You can rearrange the dict to match the structure expected for a "unstructured" json format in their documentation pretty easily and then use the duckdb.read_json function to load it the same way as polars, which I think the closest you can get with the library as it currently stands.

    Here is a demonstration which shows the polars interpretation we expected, the naive duckdb interpretation we probably didn't and the transformation necessary to load it like polars did:

    import polars
    import duckdb
    import io
    import json
    # Silent requirement for fsspec
    
    data = {'a': [1,2,3], 'b': [4,5,6]}
    polars_data = polars.DataFrame(data)
    print('polars\' interpretation:')
    print(polars_data)
    
    duckdb_data = duckdb.read_json(io.StringIO(json.dumps(data)))
    print('duckdb \'s naive interpretation:')
    print(duckdb_data)
    
    def transform_to_duckjson(dictdata: dict):
        return [{fieldname: fieldrecords[recordnum] for fieldname, fieldrecords in dictdata.items()} for recordnum in range(len(dictdata[next(iter(dictdata.keys()))]))]
    
    duckdb_data2 = duckdb.read_json(io.StringIO(json.dumps(transform_to_duckjson(data))))
    print('duckdb \'s interpretation after adjustment:')
    print(duckdb_data2)
    

    Which gives me the following output:

    polars' interpretation:
    shape: (3, 2)
    ┌─────┬─────┐
    │ a   ┆ b   │
    │ --- ┆ --- │
    │ i64 ┆ i64 │
    ╞═════╪═════╡
    │ 1   ┆ 4   │
    │ 2   ┆ 5   │
    │ 3   ┆ 6   │
    └─────┴─────┘
    duckdb 's naive interpretation:
    ┌───────────┬───────────┐
    │     a     │     b     │
    │  int64[]  │  int64[]  │
    ├───────────┼───────────┤
    │ [1, 2, 3] │ [4, 5, 6] │
    └───────────┴───────────┘
    
    duckdb 's interpretation after adjustment:
    ┌───────┬───────┐
    │   a   │   b   │
    │ int64 │ int64 │
    ├───────┼───────┤
    │     1 │     4 │
    │     2 │     5 │
    │     3 │     6 │
    └───────┴───────┘
    
    

    As a note: Here is the python api reference (scroll down to "read_csv(..." for the relevant information) which doesn't provide any help at all for using the function unfortunately.

    Let me know if you have any questions.