I've already tried everything posted here but nothing is working, so please don't mark this as duplicate because I think the problem is different.
I have a json like this:
[{'Id': 1,
'Design': ["09",
'10',
'13'
],
'Research': ['Eng',
'Math']
}]
Plus other non-list colums. This is repeated for 500 ids.
I need to explode the list columns. the final output should be an excel file, I don't care if the explosion is done directly in json or in pandas.
Already tried:
def lenx(x):
return len(x) if isinstance(x,(list, tuple, np.ndarray, pd.Series)) else 1
def cell_size_equalize2(row, cols='', fill_mode='internal', fill_value=''):
jcols = [j for j,v in enumerate(row.index) if v in cols]
if len(jcols)<1:
jcols = range(len(row.index))
Ls = [lenx(x) for x in row.values]
if not Ls[:-1]==Ls[1:]:
vals = [v if isinstance(v,list) else [v] for v in row.values]
if fill_mode=='external':
vals = [[e] + [fill_value]*(max(Ls)-1) if (not j in jcols) and (isinstance(row.values[j],list))
else e + [fill_value]*(max(Ls)-lenx(e))
for j,e in enumerate(vals)]
elif fill_mode == 'internal':
vals = [[e]+[e]*(max(Ls)-1) if (not j in jcols) and (isinstance(row.values[j],list))
else e+[e[-1]]*(max(Ls)-lenx(e))
for j,e in enumerate(vals)]
else:
vals = [e[0:min(Ls)] for e in vals]
row = pd.Series(vals,index=row.index.tolist())
return row
Leads to index error
df.explode(['B', 'C', 'D', 'E']).reset_index(drop=True)
Columns must have same lenght
df1 = pd.concat([df[x].explode().to_frame()
.assign(g=lambda x: x.groupby(level=0).cumcount())
.set_index('g', append=True)
for x in cols_to_explode], axis=1)
Somehow it creates lots of rows, I think it just explodes a column after another, and it leads to memory error.
Desired Output:
Id Design Research
1 09 Eng
1 10 Math
1 13
You can use json_normalize
and a deduplication-explode (as presented here):
tmp = pd.json_normalize(json)
def explode_dedup(s):
s = s.explode()
return s.set_axis(
pd.MultiIndex.from_arrays([s.index, s.groupby(level=0).cumcount()])
)
ids = ['Id']
cols = tmp.columns.difference(ids)
out = (tmp[ids]
.join(pd.concat({c: explode_dedup(tmp[c])
for c in cols}, axis=1)
.droplevel(-1)
)[tmp.columns]
)
NB. if you know the columns to explode, you can alternatively use:
cols = ['Design', 'Research']
ids = tmp.columns.difference(cols)
Output:
Id Design Research
0 1 09 Eng
0 1 10 Math
0 1 13 NaN