Search code examples
pythonpandasdataframetransform

pd.wide_to_long in python is slow


I have a dataframe with 55049 rows and 667 columns in it.

Sample dataframe structure as follows:


data = {
    'g1': [1],
    'g2': [2],
    'g3': [3],
    
    'st1_1': [1],
    'st1_2': [1],
    'st1_3': [1],
    'st1_4': [1],
    'st1_5': [5],
    'st1_6': [5],
    'st1_7': [5],
    'st1_8': [5],
    
    'st1_Next_1': [8],
    'st1_Next_2': [8],
    'st1_Next_3': [8],
    'st1_Next_4': [8],
    'st1_Next_5': [9],
    'st1_Next_6': [9],
    'st1_Next_7': [9],
    'st1_Next_8': [9],
    
    'st2_1': [2],
    'st2_2': [2],
    'st2_3': [2],
    'st2_4': [2],
    'st2_5': [2],
    'st2_6': [2],
    'st2_7': [2],
    'st2_8': [2],
    
    'ft_1': [1],
    'ft_2': [0],
    'ft_3': [1],
    'ft_4': [1],
    'ft_5': [1],
    'ft_6': [0],
    'ft_7': [0],
    'ft_8': [1]
}

df = pd.DataFrame(data)
print(df)

To get my desired output I have the following code where I am using pd.wide_to_long

ilist = ['g1','g2','g3']
stublist = ['st1','st1_Next','st2','ft']

df_long = pd.wide_to_long(
      df.reset_index(),
      i=['index']+ilist ,
      stubnames= stublist,
      j='j', sep='_').reset_index()

df_long = df_long[df_long['ft']==1]

Above code is working in fine with expected results.

I perfromed this wide_to_long to apply the filter df_long[df_long['ft']==1]. which means ft_1 need to apply for all _1, ft_2 for all _2.....and so for all _8.

Problem is to perform wide_to_long operation it took around 2 mins, Since I have 800+ source files to process the whole process is taking 1600 mins which is quite high.

I Am looking for any alternative suggestions to transpose the data.

I have Tried this but didn't work for me with much differenece.

As @sammywemmy suggested, I have tried below code. But output is missing st1_Next.


ilist = ['g1','g2','g3']
stublist = ['st1','st1_Next','st2','ft']


df_pvot = df.pivot_longer(index=ilist,names_to=stublist,names_pattern=stublist)
print(df_pvot)

Output is missing st1_Next and data clubbing with st1 Instead of new column.

Output:

    g1  g2  g3  st1  st2   ft
0    1   2   3    1  2.0  1.0
1    1   2   3    1  2.0  0.0
2    1   2   3    1  2.0  1.0
3    1   2   3    1  2.0  1.0
4    1   2   3    5  2.0  1.0
5    1   2   3    5  2.0  0.0
6    1   2   3    5  2.0  0.0
7    1   2   3    5  2.0  1.0
8    1   2   3    8  NaN  NaN
9    1   2   3    8  NaN  NaN
10   1   2   3    8  NaN  NaN
11   1   2   3    8  NaN  NaN
12   1   2   3    9  NaN  NaN
13   1   2   3    9  NaN  NaN
14   1   2   3    9  NaN  NaN
15   1   2   3    9  NaN  NaN


Solution

  • One option is with pivot_longer, where you pass the new header names to names_to and a list of regexes to names_pattern:

    # pip install pyjanitor
    import pandas as pd
    
    df.pivot_longer(index=ilist,names_to=stublist,names_pattern=stublist)
       g1  g2  g3  st1  st2  ft
    0   1   2   3    1    2   1
    1   1   2   3    1    2   0
    2   1   2   3    1    2   1
    3   1   2   3    1    2   1
    4   1   2   3    1    2   1
    5   1   2   3    1    2   0
    6   1   2   3    1    2   0
    7   1   2   3    1    2   1
    

    Another option is a reshaping of the columns, followed by pd.stack:

    reshaped = df.set_index(ilist)
    reshaped.columns = reshaped.columns.str.split('_',expand=True).set_names([None,'drop'])
    reshaped.stack(level='drop').droplevel('drop').reset_index()
       g1  g2  g3  st1  st2  ft
    0   1   2   3    1    2   1
    1   1   2   3    1    2   0
    2   1   2   3    1    2   1
    3   1   2   3    1    2   1
    4   1   2   3    1    2   1
    5   1   2   3    1    2   0
    6   1   2   3    1    2   0
    7   1   2   3    1    2   1
    

    this is in response to the updated question - the names_pattern relies on regexes, and under the hood, pd.Series.str.contains and np.select is used to extract and pair the columns with the regexes. As such, the regexes have to be crafted properly to match the columns:

    # pip install pyjanitor
    import pandas as pd
    import janitor 
    
    # note the inclusion of digits within the regexes
    names_pattern = [r'st1_\d+',r'st1_Next',r'st2_\d+', 'ft']
    
    df.pivot_longer(index=ilist,names_to=stublist,names_pattern=names_pattern)
       g1  g2  g3  st1  st1_Next  st2  ft
    0   1   2   3    1         8    2   1
    1   1   2   3    1         8    2   0
    2   1   2   3    1         8    2   1
    3   1   2   3    1         8    2   1
    4   1   2   3    5         9    2   1
    5   1   2   3    5         9    2   0
    6   1   2   3    5         9    2   0
    7   1   2   3    5         9    2   1
    

    You can replicate this using stack, this time using pd.Series.str.rsplit(), with n=1.

    reshaped = df.set_index(ilist)
    reshaped.columns = (reshaped
                        .columns
                        .str
                        .rsplit('_',n=1,expand=True)
                        .set_names([None, 'drop'])
                       )
    
    reshaped.stack(level='drop').droplevel('drop').reset_index()
       g1  g2  g3  st1  st1_Next  st2  ft
    0   1   2   3    1         8    2   1
    1   1   2   3    1         8    2   0
    2   1   2   3    1         8    2   1
    3   1   2   3    1         8    2   1
    4   1   2   3    5         9    2   1
    5   1   2   3    5         9    2   0
    6   1   2   3    5         9    2   0
    7   1   2   3    5         9    2   1
    

    As long as you understand the pattern your columns have, and adapt your code to them, you should be fine.