Search code examples
pythonpython-polars

polars equivalent of pandas set_index() to_dict


I have a polars dataframe:

import polars as pl
df = pl.DataFrame({'index': [1,2,3,2,1],
                   'object': [1, 1, 1, 2, 2],
                   'period': [1, 2, 4, 4, 23],
                   'value': [24, 67, 89, 5, 23]})

How do I do the following in polars that is easy enough in pandas:

In [2]: df.to_pandas().groupby("index").last().transpose().to_dict()
Out[2]: 
{1: {'object': 2, 'period': 23, 'value': 23},
 2: {'object': 2, 'period': 4, 'value': 5},
 3: {'object': 1, 'period': 4, 'value': 89}}

Solution

  • The Algorithm

    Polars does not have the concept of an index. But we can reach the same result by using partition_by.

    {
        index[0]: frame.select(pl.exclude('index')).to_dicts()[0]
        for index, frame in
            (
                df
                .unique(subset=['index'], keep='last')
                .partition_by(by=["index"],
                              as_dict=True,
                              maintain_order=True)
            ).items()
    }
    
    
    {1: {'object': 2, 'period': 23, 'value': 23},
    2: {'object': 2, 'period': 4, 'value': 5},
    3: {'object': 1, 'period': 4, 'value': 89}}
    

    In steps

    The heart of the algorithm is partition_by, with as_dict=True.

    (
        df
        .unique(subset=['index'], keep='last')
        .partition_by(by=["index"],
                      as_dict=True,
                      maintain_order=True)
    )
    
    {(1,): shape: (1, 4)
    ┌───────┬────────┬────────┬───────┐
    │ index ┆ object ┆ period ┆ value │
    │ ---   ┆ ---    ┆ ---    ┆ ---   │
    │ i64   ┆ i64    ┆ i64    ┆ i64   │
    ╞═══════╪════════╪════════╪═══════╡
    │ 1     ┆ 2      ┆ 23     ┆ 23    │
    └───────┴────────┴────────┴───────┘,
    (2,): shape: (1, 4)
    ┌───────┬────────┬────────┬───────┐
    │ index ┆ object ┆ period ┆ value │
    │ ---   ┆ ---    ┆ ---    ┆ ---   │
    │ i64   ┆ i64    ┆ i64    ┆ i64   │
    ╞═══════╪════════╪════════╪═══════╡
    │ 2     ┆ 2      ┆ 4      ┆ 5     │
    └───────┴────────┴────────┴───────┘,
    (3,): shape: (1, 4)
    ┌───────┬────────┬────────┬───────┐
    │ index ┆ object ┆ period ┆ value │
    │ ---   ┆ ---    ┆ ---    ┆ ---   │
    │ i64   ┆ i64    ┆ i64    ┆ i64   │
    ╞═══════╪════════╪════════╪═══════╡
    │ 3     ┆ 1      ┆ 4      ┆ 89    │
    └───────┴────────┴────────┴───────┘}
    

    This creates a dictionary where the keys are the index values (tuple), and the values are the one-row sub-dataframes associated with each index.

    Using these dictionaries, we can then construct our nested dictionaries using a Python dictionary comprehension as:

    {
        index[0]: frame.to_dicts()
        for index, frame in
            (
                df
                .unique(subset=['index'], keep='last')
                .partition_by(by=["index"],
                              as_dict=True,
                              maintain_order=True)
            ).items()
    }
    
    {1: [{'index': 1, 'object': 2, 'period': 23, 'value': 23}],
    2: [{'index': 2, 'object': 2, 'period': 4, 'value': 5}],
    3: [{'index': 3, 'object': 1, 'period': 4, 'value': 89}]}
    

    All that is left is tidying up the output so that index does not appear in the nested dictionaries, and getting rid of the unneeded list.

    {
        index[0]: frame.select(pl.exclude('index')).to_dicts()[0]
        for index, frame in
            (
                df
                .unique(subset=['index'], keep='last')
                .partition_by(by=["index"],
                              as_dict=True,
                              maintain_order=True)
            ).items()
    }
    
    {1: {'object': 2, 'period': 23, 'value': 23},
    2: {'object': 2, 'period': 4, 'value': 5},
    3: {'object': 1, 'period': 4, 'value': 89}}