Search code examples
pythonpandasdataframemulti-index

Create multiindex Dataframe with column containing a list


I'm having a pandas dataframe that looks like this in E I have a list

A B C D E
1 1 1 1 [obj1, obj2,obj3]
1 1 2 1 empty list
1 1 2 1 [obj1]
1 2 2 1 [obj1, obj2]
2 1 1 1 empty list
2 1 2 1 [obj1, obj2,obj3]

I want to make it a multiindex (or any kind of dataframe) that looks like this. That I have a row for each value of the list, but without duplicting the values

A B C D E
1 1 1 1 obj1
obj2
obj3
1 1 2 1
1 1 2 1 obj1
1 2 2 1 obj1
obj2
2 1 1 1
2 1 2 1 obj1
obj2
obj3

But I cannot figure out how to do it.

I tried using

dfexplode(['E'],ignore_index=True)

but this enflattens the list but duplicated the column values A-D. So when I count the values, I also count the duplicates.

Any ideas?


Solution

  • You can explode your data and reset the index to compare it with the shifted one. This mask allows you to overwrite all values for a-d where the index equals the shifted index. The following code does this:

    import pandas as pd
    
    df_exploded = df.explode(column="e").reset_index()
    df_exploded.loc[df_exploded["index"] == df_exploded["index"].shift(), df.columns[:-1]] = ""
    df_exploded = df_exploded.drop(columns="index")
    

    It results in the following table:

        a  b  c  d    e
    0   1  1  1  1   10
    1                11
    2                12
    3   1  1  2  1  NaN
    4   1  1  2  1    3
    5   1  2  2  1   40
    6                20
    7   2  1  1  1  NaN
    8   2  1  2  1   30
    9                12
    10               11
    

    Note, if you want numeric columns, you can also assign nan values instead of empty strings. To do so, just adjust the following line:

    df_exploded.loc[df_exploded["index"] == df_exploded["index"].shift(), df.columns[:-1]] = pd.NA # or np.nan if you want to use numpy