I have 2 dataframes having the same columns and row numbers, some of these data frames columns are Dates(1 per dataframe), others floats and objects where strings are stored.
Let's take the below as a simplified example for df1(my actual dataframe has hundreds of columns):
Date Val0 Val1
0 2020-09-29 13:22:57 char1 5.34
1 2020-09-29 13:23:12 char2 4.5
2 2020-09-29 13:23:44 char3 Nan
3 2020-09-29 13:24:01 char4 24
val0
in that example is an object
df2
looks like:
Date Val0 Val1
0 2020-09-29 13:22:58 char6 Nan
1 2020-09-29 13:23:12 char4 89.77
2 2020-09-29 13:23:45 Nan
3 2020-09-29 13:24:01 char4 24
Expected
Date Date Val0 Val0 Val1 Val1 Val1
0 2020-09-29 13:22:57 2020-09-29 13:22:58 char1 char6 5.34 Nan 5.34
1 2020-09-29 13:23:12 2020-09-29 13:23:12 char2 char4 4.5 89.77 -85.27
2 2020-09-29 13:23:44 2020-09-29 13:23:45 char3 Nan Nan Nan
3 2020-09-29 13:24:01 2020-09-29 13:24:01 char4 char4 24 24 0
As we can see the floats got deducted and we had a new column having the difference while the objects having the same column are now 'next to one another'
I do realize that having the same column names might not be the approach, example what if we want to drop one or the .2
added by default by pandas, for me the column names at this stage don't matter they can be whatever as they're aligned as demonstrated and would still help me in my next step when writing to excel which is merging the row names only, example we'd have one
cell with 2 columns under it while it would also be 1 row name of val1
and 3 columns under it.
My code and approach:
I wanted to get rid of the stuff i can't deduct, in this case val0
, so i dropped that column from df1
and df2
in copied dataframes then deducted the outcome dfs giving me the last column for val1 and the deduction of dates just in case i wanted it. then concatenated df1
and df_2
looking like the below correspondant to df_6
:
df_minus = df_2 - df
df5 = pd.concat([df_2, df], axis=1)
df_6 = df5[list(sum(zip(df_2.columns, df.columns), ()))]
print(df_6)
Date Date Val0 Val0 Val1 Val1
0 2020-09-29 13:22:57 2020-09-29 13:22:58 char1 char6 5.34 Nan
1 2020-09-29 13:23:12 2020-09-29 13:23:12 char2 char4 4.5 89.77
2 2020-09-29 13:23:44 2020-09-29 13:23:45 char3 Nan Nan
3 2020-09-29 13:24:01 2020-09-29 13:24:01 char4 char4 24 24
My approach did solve the columns being objects issue, but then for the floats i though about using df_minus somehow but couldn't find a way.Some might believe i should simply take that column from df_minus
and add it in df_6
but let's keep in mind that my actual data has hundreds of column so that's not an option.
Here's an approach that should get you what I think you're trying to do.
We identify numeric columns for which the subtraction operation makes sense, then do the subtraction and use the results to update df_minus
, a copy of one of the input dataframes. We then rename the columns in df2
by appending a suffix '.2' and columns in df_minus
by appending either '.diff' for our subtraction results or '.delete' (marked for future deletion) for all other columns.
We then use your column sorting logic, but with 3 dataframes instead of just 2, so we have df1
, df2
and df_minus
in the results. Finally, we delete the unwanted columns that came from df_minus
.
import pandas as pd
import numpy as np
df1 = pd.DataFrame([
{'Date':'2020-09-29 13:22:57','Val0':'char1','Val1':5.34},
{'Date':'2020-09-29 13:23:12','Val0':'char2','Val1':4.5},
{'Date':'2020-09-29 13:23:44','Val0':'char3','Val1':np.NaN},
{'Date':'2020-09-29 13:24:01','Val0':'char4','Val1':24}
])
df2 = pd.DataFrame([
{'Date':'2020-09-29 13:22:58','Val0':'char6','Val1':np.NaN},
{'Date':'2020-09-29 13:23:12','Val0':'char4','Val1':89.77},
{'Date':'2020-09-29 13:23:45','Val0':'', 'Val1':np.NaN},
{'Date':'2020-09-29 13:24:01','Val0':'char4','Val1':24}
])
print(f"df1:\n{df1}\n")
print(f"df2:\n{df2}\n")
dfNumericCols = set(c for c in df1.columns if isinstance(df1.loc[0,c], (float,int)))
df_minus = df1.copy(deep=True)
df_minus[list(dfNumericCols)] = df1[list(dfNumericCols)] - df2[list(dfNumericCols)]
df_minus.rename(columns={c:c+'.diff' if c in dfNumericCols else c+'.delete' for c in df_minus.columns}, inplace=True)
print(f"df_minus:\n{df_minus}\n")
df2.rename(columns={c:c+'.2' for c in df2.columns}, inplace=True)
df5 = pd.concat([df1, df2, df_minus], axis=1)
df6 = df5[list(sum(zip(df1.columns, df2.columns, df_minus.columns), ()))]
df6 = df6.drop(columns=[c for c in df6 if c[-len('.delete'):] == '.delete'], axis=1)
print(f"df6:\n{df6}\n")
Output:
df1:
Date Val0 Val1
0 2020-09-29 13:22:57 char1 5.34
1 2020-09-29 13:23:12 char2 4.50
2 2020-09-29 13:23:44 char3 NaN
3 2020-09-29 13:24:01 char4 24.00
df2:
Date Val0 Val1
0 2020-09-29 13:22:58 char6 NaN
1 2020-09-29 13:23:12 char4 89.77
2 2020-09-29 13:23:45 NaN
3 2020-09-29 13:24:01 char4 24.00
df_minus:
Date.delete Val0.delete Val1.diff
0 2020-09-29 13:22:57 char1 NaN
1 2020-09-29 13:23:12 char2 -85.27
2 2020-09-29 13:23:44 char3 NaN
3 2020-09-29 13:24:01 char4 0.00
df6:
Date Date.2 Val0 Val0.2 Val1 Val1.2 Val1.diff
0 2020-09-29 13:22:57 2020-09-29 13:22:58 char1 char6 5.34 NaN NaN
1 2020-09-29 13:23:12 2020-09-29 13:23:12 char2 char4 4.50 89.77 -85.27
2 2020-09-29 13:23:44 2020-09-29 13:23:45 char3 NaN NaN NaN
3 2020-09-29 13:24:01 2020-09-29 13:24:01 char4 char4 24.00 24.00 0.00