Search code examples
pythonlistpandasunpack

Unpack the list element of DataFrame


I have this df:

l1 = ['a', 'b', 'c']
l2 = ['x', ['y1', 'y2', 'y3'], 'z']
df = pd.DataFrame(list(zip(l1, l2)), columns = ['l1', 'l2'])

result:

  l1            l2
0  a             x
1  b  [y1, y2, y3]
2  c             z

What i need is to unpack the inner list in l2 and spread the corresponding value in l1 like this:

  l1  l2
0  a   x
1  b  y1
2  b  y2
3  b  y3
4  c   z

What is the proper way to do this? Thanks.


Solution

  • I think you can use numpy.repeat for repeat values by legths by str.len and flat values of nested lists by chain:

    from  itertools import chain
    
    df1 = pd.DataFrame({
            "l1": np.repeat(df.l1.values, df.l2.str.len()),
            "l2": list(chain.from_iterable(df.l2))})
    print (df1)
      l1  l2
    0  a   x
    1  b  y1
    2  b  y2
    3  b  y3
    4  c   z
    

    Timings:

    #[100000 rows x 2 columns]
    np.random.seed(10)
    N = 100000
    l1 = ['a', 'b', 'c']
    l1 = np.random.choice(l1, N)
    l2 = [list(tuple(string.ascii_letters[:np.random.randint(1, 10)])) for _ in np.arange(N)]
    df = pd.DataFrame({"l1":l1, "l2":l2})
    df.l2 = df.l2.apply(lambda x: x if len(x) !=1 else x[0])
    #print (df)
    
    
    In [91]: %timeit (pd.DataFrame([(left, right) for outer in zip(l1, l2) for left, right in zip_longest(*outer, fillvalue=outer[0])]))
    1 loop, best of 3: 242 ms per loop
    
    In [92]: %timeit (pd.DataFrame({ "l1": np.repeat(df.l1.values, df.l2.str.len()), "l2": list(chain.from_iterable(df.l2))}))
    10 loops, best of 3: 84.6 ms per loop
    

    Conclusion:

    numpy.repeat is 3 times faster as zip_longest solution in larger df.

    EDIT:

    For compare with loop version is necessery smaller df, because very slow:

    #[1000 rows x 2 columns]
    np.random.seed(10)
    N = 1000
    l1 = ['a', 'b', 'c']
    l1 = np.random.choice(l1, N)
    l2 = [list(tuple(string.ascii_letters[:np.random.randint(1, 10)])) for _ in np.arange(N)]
    df = pd.DataFrame({"l1":l1, "l2":l2})
    df.l2 = df.l2.apply(lambda x: x if len(x) !=1 else x[0])
    #print (df)
    
    def alexey(df):
        df2 = pd.DataFrame(columns=df.columns,index=df.index)[0:0]
    
        for idx in df.index:
            new_row = df.loc[idx, :].copy()
            for res in df.ix[idx, 'l2']:
                new_row.set_value('l2', res)
                df2.loc[len(df2)] = new_row
        return df2
    
    print (alexey(df))
    
    In [20]: %timeit (alexey(df))
    1 loop, best of 3: 11.4 s per loop
    
    In [21]: %timeit pd.DataFrame([(left, right) for outer in zip(l1, l2) for left, right in zip_longest(*outer, fillvalue=outer[0])])
    100 loops, best of 3: 2.57 ms per loop
    
    In [22]: %timeit pd.DataFrame({ "l1": np.repeat(df.l1.values, df.l2.str.len()), "l2": list(chain.from_iterable(df.l2))})
    The slowest run took 4.42 times longer than the fastest. This could mean that an intermediate result is being cached.
    1000 loops, best of 3: 1.41 ms per loop