Search code examples
pythonpandasdataframepandas-explode

How to unnest (explode) a column in a pandas DataFrame, into multiple rows


I have the following DataFrame where one of the columns is an object (list type cell):

df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})

Output:

   A       B
0  1  [1, 2]
1  2  [1, 2]

My expected output is:

   A  B
0  1  1
1  1  2
3  2  1
4  2  2

What should I do to achieve this?


Related question

Pandas column of lists, create a row for each list element

Good question and answer but only handle one column with list(In my answer the self-def function will work for multiple columns, also the accepted answer is use the most time consuming apply , which is not recommended, check more info When should I (not) want to use pandas apply() in my code?)


Solution

  • I know object dtype columns makes the data hard to convert with pandas functions. When I receive data like this, the first thing that came to mind was to "flatten" or unnest the columns.

    I am using pandas and Python functions for this type of question. If you are worried about the speed of the above solutions, check out user3483203's answer, since it's using numpy and most of the time numpy is faster. I recommend Cython or numba if speed matters.


    Method 0 [pandas >= 0.25] Starting from pandas 0.25, if you only need to explode one column, you can use the pandas.DataFrame.explode function:

    df.explode('B')
    
           A  B
        0  1  1
        1  1  2
        0  2  1
        1  2  2
    

    Given a dataframe with an empty list or a NaN in the column. An empty list will not cause an issue, but a NaN will need to be filled with a list

    df = pd.DataFrame({'A': [1, 2, 3, 4],'B': [[1, 2], [1, 2], [], np.nan]})
    df.B = df.B.fillna({i: [] for i in df.index})  # replace NaN with []
    df.explode('B')
    
       A    B
    0  1    1
    0  1    2
    1  2    1
    1  2    2
    2  3  NaN
    3  4  NaN
    

    Method 1 apply + pd.Series (easy to understand but in terms of performance not recommended . )

    df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'B'})
    Out[463]:
       A  B
    0  1  1
    1  1  2
    0  2  1
    1  2  2
    

    Method 2 Using repeat with DataFrame constructor , re-create your dataframe (good at performance, not good at multiple columns )

    df=pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})
    df
    Out[465]:
       A  B
    0  1  1
    0  1  2
    1  2  1
    1  2  2
    

    Method 2.1 for example besides A we have A.1 .....A.n. If we still use the method(Method 2) above it is hard for us to re-create the columns one by one .

    Solution : join or merge with the index after 'unnest' the single columns

    s=pd.DataFrame({'B':np.concatenate(df.B.values)},index=df.index.repeat(df.B.str.len()))
    s.join(df.drop('B',1),how='left')
    Out[477]:
       B  A
    0  1  1
    0  2  1
    1  1  2
    1  2  2
    

    If you need the column order exactly the same as before, add reindex at the end.

    s.join(df.drop('B',1),how='left').reindex(columns=df.columns)
    

    Method 3 recreate the list

    pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
    Out[488]:
       A  B
    0  1  1
    1  1  2
    2  2  1
    3  2  2
    

    If more than two columns, use

    s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
    s.merge(df,left_on=0,right_index=True)
    Out[491]:
       0  1  A       B
    0  0  1  1  [1, 2]
    1  0  2  1  [1, 2]
    2  1  1  2  [1, 2]
    3  1  2  2  [1, 2]
    

    Method 4 using reindex or loc

    df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
    Out[554]:
       A  B
    0  1  1
    0  1  2
    1  2  1
    1  2  2
    
    #df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))
    

    Method 5 when the list only contains unique values:

    df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]]})
    from collections import ChainMap
    d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
    pd.DataFrame(list(d.items()),columns=df.columns[::-1])
    Out[574]:
       B  A
    0  1  1
    1  2  1
    2  3  2
    3  4  2
    

    Method 6 using numpy for high performance:

    newvalues=np.dstack((np.repeat(df.A.values,list(map(len,df.B.values))),np.concatenate(df.B.values)))
    pd.DataFrame(data=newvalues[0],columns=df.columns)
       A  B
    0  1  1
    1  1  2
    2  2  1
    3  2  2
    

    Method 7 using base function itertools cycle and chain: Pure python solution just for fun

    from itertools import cycle,chain
    l=df.values.tolist()
    l1=[list(zip([x[0]], cycle(x[1])) if len([x[0]]) > len(x[1]) else list(zip(cycle([x[0]]), x[1]))) for x in l]
    pd.DataFrame(list(chain.from_iterable(l1)),columns=df.columns)
       A  B
    0  1  1
    1  1  2
    2  2  1
    3  2  2
    

    Generalizing to multiple columns

    df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]]})
    df
    Out[592]:
       A       B       C
    0  1  [1, 2]  [1, 2]
    1  2  [3, 4]  [3, 4]
    

    Self-def function:

    def unnesting(df, explode):
        idx = df.index.repeat(df[explode[0]].str.len())
        df1 = pd.concat([
            pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
        df1.index = idx
    
        return df1.join(df.drop(explode, 1), how='left')
    
    
    unnesting(df,['B','C'])
    Out[609]:
       B  C  A
    0  1  1  1
    0  2  2  1
    1  3  3  2
    1  4  4  2
    

    Column-wise Unnesting

    All above method is talking about the vertical unnesting and explode , If you do need expend the list horizontal, Check with pd.DataFrame constructor

    df.join(pd.DataFrame(df.B.tolist(),index=df.index).add_prefix('B_'))
    Out[33]:
       A       B       C  B_0  B_1
    0  1  [1, 2]  [1, 2]    1    2
    1  2  [3, 4]  [3, 4]    3    4
    

    Updated function

    def unnesting(df, explode, axis):
        if axis==1:
            idx = df.index.repeat(df[explode[0]].str.len())
            df1 = pd.concat([
                pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
            df1.index = idx
    
            return df1.join(df.drop(explode, 1), how='left')
        else :
            df1 = pd.concat([
                             pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1)
            return df1.join(df.drop(explode, 1), how='left')
    

    Test Output

    unnesting(df, ['B','C'], axis=0)
    Out[36]:
       B0  B1  C0  C1  A
    0   1   2   1   2  1
    1   3   4   3   4  2
    

    Update 2021-02-17 with original explode function

    def unnesting(df, explode, axis):
        if axis==1:
            df1 = pd.concat([df[x].explode() for x in explode], axis=1)
            return df1.join(df.drop(explode, 1), how='left')
        else :
            df1 = pd.concat([
                             pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1)
            return df1.join(df.drop(explode, 1), how='left')