Search code examples
pandasmergefillna

Pandas merge rows if blank


I am looking to combine rows based on an id as long as the merge is not overwriting a value.

so for a df :

Column1 Column2 Column3 Column4
aa_1    123     456 
aa_2        123     
aa_4            123 
aa_6                
aa_1        789     789
aa_2    4567            
aa_4    234 456 678 987
aa_6    123 456 567 765

I am looking for the end result to be:

,Column1,Column2,Column3,Column4
aa_1,123,789,456,789
aa_2,4567,123,,
aa_4,234,456,678,987
aa_4,,,123,
aa_6,123,456,567,765

Code for recreating dataframe:

data = [['123', '', '456', ''], ['', '123', '', ''], ['', '', '123', ''], ['', '', '', ''], ['', '789', '', '789'], ['4567', '', '', ''], ['234', '456', '678', '987'], ['123', '456', '567', '765']]
index = ['aa_1', 'aa_2', 'aa_4', 'aa_6', 'aa_1', 'aa_2', 'aa_4', 'aa_6']
columns = ['Column1', 'Column2', 'Column3', 'Column4']

df = pd.DataFrame(data, index, columns)


Solution

  • It may not be the most elegant of all possible solutions, but this does the job:

    The idea it to first set aside rows that already are filled. In your case it is the last two rows of your example. The reason for this is that they will always interfere with any other row beeing indexed the same way (aa_4 and aa_6). The next step is to merged rows containing empty cells. However, it is easier to deal with if

    1. The index is unique (thereof the reindexing in the solution)
    2. I empty cells are nan (thereof the replace statement)

    So:

    import pandas as pd
    import numpy as np
    
    data = [['123', '', '456', ''], ['', '123', '', ''], ['', '', '123', ''], ['', '', '', ''], ['', '789', '', '789'], ['4567', '', '', ''], ['234', '456', '678', '987'], ['123', '456', '567', '765']]
    index = ['aa_1', 'aa_2', 'aa_4', 'aa_6', 'aa_1', 'aa_2', 'aa_4', 'aa_6']
    columns = ['Column1', 'Column2', 'Column3', 'Column4']
    
    df = pd.DataFrame(data, index, columns)
    df = df.reset_index()
    
    df2 = df.replace(r'^\s*$', np.nan, regex=True)
    df3 = df2.dropna()
    df4 = df[~df.index.isin(df3.index)]
    df5 = df4.groupby(['index']).sum().reset_index()
    df_full = pd.concat([df5,df3]).set_index(['index'])  #resets your original index
    

    which returns

          Column1 Column2 Column3 Column4
    index                                
    aa_1      123     789     456     789
    aa_2     4567     123                
    aa_4                      123        
    aa_6                                 
    aa_4      234     456     678     987
    aa_6      123     456     567     765
    

    You only need to remove the empty row, as I did in the solution, e.g.,

    df_full = df_full.replace(r'^\s*$', np.nan, regex=True)
    df_full = df_full.dropna(how='all')