I have a dataframe which has a structure like below. id, name, value, employee_0_salary, employee_0_skills_0_id, employee_0_skills_1_id, employee_1_salary, employee_1_skills_0_id so on..
Basically formed from a flattened json. But the number of columns exceed 1024. I couldn't copy the structure of json here but its a nested json with lists, arrays and dictionaries.
Now I want to split the dataframe so that the id column has duplicates but empoyee details comes as columns.
id salary skills_id
How to do it in pandas?
TIA
maybe melt() does what you want?
df = pd.DataFrame({
'id': [1, 2, 1],
'name': ['joe','sue', 'fred'],
'employee_0_salary': [30000, 35000, 40000],
'employee_0_skills_0_id': [101, 102, 103],
'employee_0_skills_1_id': [103, 104, 105],
'employee_1_salary': [32000, 36000, 37000],
'employee_1_skills_0_id': [105, 106, 107],
# Add other columns...
})
print(df)
id name employee_0_salary employee_0_skills_0_id \
0 1 joe 30000 101
1 2 sue 35000 102
2 1 fred 40000 103
employee_0_skills_1_id employee_1_salary employee_1_skills_0_id
0 103 32000 105
1 104 36000 106
2 105 37000 107
print(df.melt(id_vars=['id','name']))
id name variable value
0 1 joe employee_0_salary 30000
1 2 sue employee_0_salary 35000
2 1 fred employee_0_salary 40000
3 1 joe employee_0_skills_0_id 101
4 2 sue employee_0_skills_0_id 102
5 1 fred employee_0_skills_0_id 103
6 1 joe employee_0_skills_1_id 103
7 2 sue employee_0_skills_1_id 104
8 1 fred employee_0_skills_1_id 105
9 1 joe employee_1_salary 32000
10 2 sue employee_1_salary 36000
11 1 fred employee_1_salary 37000
12 1 joe employee_1_skills_0_id 105
13 2 sue employee_1_skills_0_id 106
14 1 fred employee_1_skills_0_id 107
then, to return to the original it is messy because of the duplicate in the id column. This is one way which adds a level to the row index so the pivot works, then drops it:
print(meltdf.reset_index().pivot(index=['index','id','name'], columns='variable', values='value') \
.droplevel(level=0) \
.groupby(level=0) \
.transform(lambda x: sorted(x, key=lambda k: pd.isna(k))) \
.dropna(axis=0))
variable employee_0_salary employee_0_skills_0_id employee_0_skills_1_id \
id name
1 joe 30000.0 101.0 103.0
2 sue 35000.0 102.0 104.0
1 fred 40000.0 103.0 105.0
variable employee_1_salary employee_1_skills_0_id
id name
1 joe 32000.0 105.0
2 sue 36000.0 106.0
1 fred 37000.0 107.0