I have a dataframe with 55049 rows and 667 columns in it.
Sample dataframe structure as follows:
data = {
'g1': [1],
'g2': [2],
'g3': [3],
'st1_1': [1],
'st1_2': [1],
'st1_3': [1],
'st1_4': [1],
'st1_5': [5],
'st1_6': [5],
'st1_7': [5],
'st1_8': [5],
'st1_Next_1': [8],
'st1_Next_2': [8],
'st1_Next_3': [8],
'st1_Next_4': [8],
'st1_Next_5': [9],
'st1_Next_6': [9],
'st1_Next_7': [9],
'st1_Next_8': [9],
'st2_1': [2],
'st2_2': [2],
'st2_3': [2],
'st2_4': [2],
'st2_5': [2],
'st2_6': [2],
'st2_7': [2],
'st2_8': [2],
'ft_1': [1],
'ft_2': [0],
'ft_3': [1],
'ft_4': [1],
'ft_5': [1],
'ft_6': [0],
'ft_7': [0],
'ft_8': [1]
}
df = pd.DataFrame(data)
print(df)
To get my desired output I have the following code where I am using pd.wide_to_long
ilist = ['g1','g2','g3']
stublist = ['st1','st1_Next','st2','ft']
df_long = pd.wide_to_long(
df.reset_index(),
i=['index']+ilist ,
stubnames= stublist,
j='j', sep='_').reset_index()
df_long = df_long[df_long['ft']==1]
Above code is working in fine with expected results.
I perfromed this wide_to_long to apply the filter
df_long[df_long['ft']==1]
. which means ft_1 need to apply for all _1, ft_2 for all _2.....and so for all _8.
Problem is to perform wide_to_long operation it took around 2 mins, Since I have 800+ source files to process the whole process is taking 1600 mins which is quite high.
I Am looking for any alternative suggestions to transpose the data.
I have Tried this but didn't work for me with much differenece.
As @sammywemmy suggested, I have tried below code. But output is missing
st1_Next
.
ilist = ['g1','g2','g3']
stublist = ['st1','st1_Next','st2','ft']
df_pvot = df.pivot_longer(index=ilist,names_to=stublist,names_pattern=stublist)
print(df_pvot)
Output is missing st1_Next and data clubbing with st1 Instead of new column.
Output:
g1 g2 g3 st1 st2 ft
0 1 2 3 1 2.0 1.0
1 1 2 3 1 2.0 0.0
2 1 2 3 1 2.0 1.0
3 1 2 3 1 2.0 1.0
4 1 2 3 5 2.0 1.0
5 1 2 3 5 2.0 0.0
6 1 2 3 5 2.0 0.0
7 1 2 3 5 2.0 1.0
8 1 2 3 8 NaN NaN
9 1 2 3 8 NaN NaN
10 1 2 3 8 NaN NaN
11 1 2 3 8 NaN NaN
12 1 2 3 9 NaN NaN
13 1 2 3 9 NaN NaN
14 1 2 3 9 NaN NaN
15 1 2 3 9 NaN NaN
One option is with pivot_longer, where you pass the new header names to names_to
and a list of regexes to names_pattern
:
# pip install pyjanitor
import pandas as pd
df.pivot_longer(index=ilist,names_to=stublist,names_pattern=stublist)
g1 g2 g3 st1 st2 ft
0 1 2 3 1 2 1
1 1 2 3 1 2 0
2 1 2 3 1 2 1
3 1 2 3 1 2 1
4 1 2 3 1 2 1
5 1 2 3 1 2 0
6 1 2 3 1 2 0
7 1 2 3 1 2 1
Another option is a reshaping of the columns, followed by pd.stack:
reshaped = df.set_index(ilist)
reshaped.columns = reshaped.columns.str.split('_',expand=True).set_names([None,'drop'])
reshaped.stack(level='drop').droplevel('drop').reset_index()
g1 g2 g3 st1 st2 ft
0 1 2 3 1 2 1
1 1 2 3 1 2 0
2 1 2 3 1 2 1
3 1 2 3 1 2 1
4 1 2 3 1 2 1
5 1 2 3 1 2 0
6 1 2 3 1 2 0
7 1 2 3 1 2 1
this is in response to the updated question - the names_pattern
relies on regexes, and under the hood, pd.Series.str.contains
and np.select
is used to extract and pair the columns with the regexes. As such, the regexes have to be crafted properly to match the columns:
# pip install pyjanitor
import pandas as pd
import janitor
# note the inclusion of digits within the regexes
names_pattern = [r'st1_\d+',r'st1_Next',r'st2_\d+', 'ft']
df.pivot_longer(index=ilist,names_to=stublist,names_pattern=names_pattern)
g1 g2 g3 st1 st1_Next st2 ft
0 1 2 3 1 8 2 1
1 1 2 3 1 8 2 0
2 1 2 3 1 8 2 1
3 1 2 3 1 8 2 1
4 1 2 3 5 9 2 1
5 1 2 3 5 9 2 0
6 1 2 3 5 9 2 0
7 1 2 3 5 9 2 1
You can replicate this using stack
, this time using pd.Series.str.rsplit()
, with n=1
.
reshaped = df.set_index(ilist)
reshaped.columns = (reshaped
.columns
.str
.rsplit('_',n=1,expand=True)
.set_names([None, 'drop'])
)
reshaped.stack(level='drop').droplevel('drop').reset_index()
g1 g2 g3 st1 st1_Next st2 ft
0 1 2 3 1 8 2 1
1 1 2 3 1 8 2 0
2 1 2 3 1 8 2 1
3 1 2 3 1 8 2 1
4 1 2 3 5 9 2 1
5 1 2 3 5 9 2 0
6 1 2 3 5 9 2 0
7 1 2 3 5 9 2 1
As long as you understand the pattern your columns have, and adapt your code to them, you should be fine.