Search code examples
pythonpandasdataframefinance

Horizontally compare two dataframes and inset blank rows for differences


Horizontally compare two dataframes and inset blank rows for differences. I need to see blank rows on the second dataframe where there are no id values

df1 =

user    id  Amount
John    1   1000
Tom     2   2000
Tom     2   3000
Claire  3   4000
Mary    4   5000
 
df2 = 

user    id  Amount
John    1   1000
Tom     2   2000
Claire  3   4000
Mary    4   5000

To Horizontally compare I am using

pd.concat([df1, df2], 
                    axis=1,


Result=

user    id  Amount      user    id  Amount
John    1   1000        John    1   1000
Tom     2   2000        Tom     2   2000
Tom     2   3000        Claire  3   4000
Claire  3   4000        Mary    4   5000
Mary    4   5000                

What I expect to happen:

user    id  Amount      user    id  Amount
John    1   1000        John    1   1000
Tom     2   2000        Tom     2   2000
Tom     2   3000                
Claire  3   4000        Claire  3   4000
Mary    4   5000        Mary    4   5000



Solution

  • This worked for me.

    First I created a data frame to find the duplicated IDs

    duplicate = df1[df1['id'].duplicated()] 
    

    Then I looked up this new dataframe to determine what IDs I needed to insert an empty line after

    df2= df2.assign(result=df2['id'].isin(duplicate['id']).astype(int))
    

    I created a result column to identify the rows where I needed to insert the empty line

    a = (df2['result'] == 1)
    df3 = df2.copy() #make a copy because we want to be safe here
    for i in df2.loc[a].index:
        empty_row = pd.DataFrame([], index=[i]) #creating the empty data
        j = i + 1 #just to get things easier to read
        df3 = pd.concat([df3.ix[:i], empty_row, df3.ix[j:]], sort=False) #slicing the df
    
    df3 = df3.reset_index(drop=True,) #reset the index
    

    Once the empty rows are inserted on df3 then I used pd.concat to display both dfs side by side

    df_all =pd.concat([df1, df3], axis=1, sort=False)