Search code examples
pythondataframelistexpandmelt

How to expand multiple list column in rows in python


I have a DataFrame like this:

ID MS DS
654 1500,10000,20000,30000 60,365,730
131 1500,10000,20000 60,365,730
598 1500,10000,20000,30000 60,365,730

The desired output looks like this:

ID MS DS
654 1500 60
654 10000 365
654 20000 730
654 30000 Nan
131 1500 60
131 10000 365
131 20000 730
598 1500 60

I have tried the following code:

from itertools import zip_longest

df.apply(lambda row: list(zip_longest(row['MS'], 
                                  row['DS'], 
                                  fillvalue = 'Nan')),axis = 1)

This did not work as expected, as it creates a Series with list values like this: [('1500,10000,20000,30000', '60,365,730')], [('1500,10000,20000', '60,365,730')].


Solution

  • Here's one approach:

    cols = ['MS', 'DS']
    
    dfs = [df[c].explode().str.split(',', expand=True)
           .astype('Int64')
           .stack()
           .rename(c) 
           for c in cols]
    
    out = (pd.concat(dfs, axis=1)
           .droplevel(1)
           .join(df['ID'])
           .loc[:, df.columns]
           )
    
    out
    
        ID     MS    DS
    0  654   1500    60
    0  654  10000   365
    0  654  20000   730
    0  654  30000  <NA>
    1  131   1500    60
    1  131  10000   365
    1  131  20000   730
    2  598   1500    60
    2  598  10000   365
    2  598  20000   730
    2  598  30000  <NA>
    

    Explanation

    1. Inside a list comprehension, for each column of ['MS', 'DS']:
      • use Series.explode to get list elements into multiple rows;
      • use Series.str.split to split on ',' and set expand to True to get the result in multiple columns;
      • chain Series.astype with Int64 to turn what will be string values into proper integers, while allowing for NaN values (the shorter sequences will have these for columns not "reached");
      • apply df.stack to get a pd.Series result (with column values now as a second level index);
      • add Series.rename to restore the original column name.
    2. Join both Series, add non-exploded column, and re-order:
      • use pd.concat along axis=1 (columns)
      • apply df.droplevel to drop the added index level;
      • use df.join to re-add the non-exploded column ID;
      • pass df.columns to df.loc to reset the column order.

    Data used

    import pandas as pd
    
    data = {
        'ID': [654, 131, 598],
        'MS': [['1500,10000,20000,30000'], ['1500,10000,20000'], 
               ['1500,10000,20000,30000']],
        'DS': [['60,365,730'], ['60,365,730'], 
               ['60,365,730']]
    }
    
    df = pd.DataFrame(data)
    df
    
        ID                        MS            DS
    0  654  [1500,10000,20000,30000]  [60,365,730]
    1  131        [1500,10000,20000]  [60,365,730]
    2  598  [1500,10000,20000,30000]  [60,365,730]