Search code examples
pythonpandas

Exploding multiple column list in pandas


I've already tried everything posted here but nothing is working, so please don't mark this as duplicate because I think the problem is different.

I have a json like this:

[{'Id': 1,
  'Design': ["09",
   '10',
   '13'
  ],
  'Research': ['Eng',
   'Math']
  }]

Plus other non-list colums. This is repeated for 500 ids.

I need to explode the list columns. the final output should be an excel file, I don't care if the explosion is done directly in json or in pandas.

Already tried:

    def lenx(x):
        return len(x) if isinstance(x,(list, tuple, np.ndarray, pd.Series)) else 1

    def cell_size_equalize2(row, cols='', fill_mode='internal', fill_value=''):
        jcols = [j for j,v in enumerate(row.index) if v in cols]
        if len(jcols)<1:
            jcols = range(len(row.index))
        Ls = [lenx(x) for x in row.values]
        if not Ls[:-1]==Ls[1:]:
            vals = [v if isinstance(v,list) else [v] for v in row.values]
            if fill_mode=='external':
                vals = [[e] + [fill_value]*(max(Ls)-1) if (not j in jcols) and (isinstance(row.values[j],list))
                        else e + [fill_value]*(max(Ls)-lenx(e))
                        for j,e in enumerate(vals)]
            elif fill_mode == 'internal':
                vals = [[e]+[e]*(max(Ls)-1) if (not j in jcols) and (isinstance(row.values[j],list))
                        else e+[e[-1]]*(max(Ls)-lenx(e)) 
                        for j,e in enumerate(vals)]
            else:
                vals = [e[0:min(Ls)] for e in vals]
            row = pd.Series(vals,index=row.index.tolist())
        return row

Leads to index error

df.explode(['B', 'C', 'D', 'E']).reset_index(drop=True)

Columns must have same lenght

df1 = pd.concat([df[x].explode().to_frame()
                      .assign(g=lambda x: x.groupby(level=0).cumcount())
                      .set_index('g', append=True) 
                for x in cols_to_explode], axis=1)

Somehow it creates lots of rows, I think it just explodes a column after another, and it leads to memory error.

Desired Output:

Id      Design     Research
1         09          Eng
1         10          Math
1         13  

Solution

  • You can use json_normalize and a deduplication-explode (as presented here):

    tmp = pd.json_normalize(json)
    
    def explode_dedup(s):
        s = s.explode()
        return s.set_axis(
            pd.MultiIndex.from_arrays([s.index, s.groupby(level=0).cumcount()])
        )
    
    ids = ['Id']
    cols = tmp.columns.difference(ids)
    
    out = (tmp[ids]
           .join(pd.concat({c: explode_dedup(tmp[c])
                            for c in cols}, axis=1)
                   .droplevel(-1)
                )[tmp.columns]
          )
    

    NB. if you know the columns to explode, you can alternatively use:

    cols = ['Design', 'Research']
    ids = tmp.columns.difference(cols)
    

    Output:

       Id Design Research
    0   1     09      Eng
    0   1     10     Math
    0   1     13      NaN