I'm having a pandas dataframe that looks like this in E I have a list
A | B | C | D | E |
---|---|---|---|---|
1 | 1 | 1 | 1 | [obj1, obj2,obj3] |
1 | 1 | 2 | 1 | empty list |
1 | 1 | 2 | 1 | [obj1] |
1 | 2 | 2 | 1 | [obj1, obj2] |
2 | 1 | 1 | 1 | empty list |
2 | 1 | 2 | 1 | [obj1, obj2,obj3] |
I want to make it a multiindex (or any kind of dataframe) that looks like this. That I have a row for each value of the list, but without duplicting the values
A | B | C | D | E |
---|---|---|---|---|
1 | 1 | 1 | 1 | obj1 |
obj2 | ||||
obj3 | ||||
1 | 1 | 2 | 1 | |
1 | 1 | 2 | 1 | obj1 |
1 | 2 | 2 | 1 | obj1 |
obj2 | ||||
2 | 1 | 1 | 1 | |
2 | 1 | 2 | 1 | obj1 |
obj2 | ||||
obj3 |
But I cannot figure out how to do it.
I tried using
dfexplode(['E'],ignore_index=True)
but this enflattens the list but duplicated the column values A-D. So when I count the values, I also count the duplicates.
Any ideas?
You can explode your data and reset the index to compare it with the shifted one. This mask allows you to overwrite all values for a-d where the index equals the shifted index. The following code does this:
import pandas as pd
df_exploded = df.explode(column="e").reset_index()
df_exploded.loc[df_exploded["index"] == df_exploded["index"].shift(), df.columns[:-1]] = ""
df_exploded = df_exploded.drop(columns="index")
It results in the following table:
a b c d e
0 1 1 1 1 10
1 11
2 12
3 1 1 2 1 NaN
4 1 1 2 1 3
5 1 2 2 1 40
6 20
7 2 1 1 1 NaN
8 2 1 2 1 30
9 12
10 11
Note, if you want numeric columns, you can also assign nan
values instead of empty strings. To do so, just adjust the following line:
df_exploded.loc[df_exploded["index"] == df_exploded["index"].shift(), df.columns[:-1]] = pd.NA # or np.nan if you want to use numpy