I am looking to combine rows based on an id as long as the merge is not overwriting a value.
so for a df :
Column1 Column2 Column3 Column4
aa_1 123 456
aa_2 123
aa_4 123
aa_6
aa_1 789 789
aa_2 4567
aa_4 234 456 678 987
aa_6 123 456 567 765
I am looking for the end result to be:
,Column1,Column2,Column3,Column4
aa_1,123,789,456,789
aa_2,4567,123,,
aa_4,234,456,678,987
aa_4,,,123,
aa_6,123,456,567,765
Code for recreating dataframe:
data = [['123', '', '456', ''], ['', '123', '', ''], ['', '', '123', ''], ['', '', '', ''], ['', '789', '', '789'], ['4567', '', '', ''], ['234', '456', '678', '987'], ['123', '456', '567', '765']]
index = ['aa_1', 'aa_2', 'aa_4', 'aa_6', 'aa_1', 'aa_2', 'aa_4', 'aa_6']
columns = ['Column1', 'Column2', 'Column3', 'Column4']
df = pd.DataFrame(data, index, columns)
It may not be the most elegant of all possible solutions, but this does the job:
The idea it to first set aside rows that already are filled. In your case it is the last two rows of your example. The reason for this is that they will always interfere with any other row beeing indexed the same way (aa_4 and aa_6). The next step is to merged rows containing empty cells. However, it is easier to deal with if
nan
(thereof the replace statement)So:
import pandas as pd
import numpy as np
data = [['123', '', '456', ''], ['', '123', '', ''], ['', '', '123', ''], ['', '', '', ''], ['', '789', '', '789'], ['4567', '', '', ''], ['234', '456', '678', '987'], ['123', '456', '567', '765']]
index = ['aa_1', 'aa_2', 'aa_4', 'aa_6', 'aa_1', 'aa_2', 'aa_4', 'aa_6']
columns = ['Column1', 'Column2', 'Column3', 'Column4']
df = pd.DataFrame(data, index, columns)
df = df.reset_index()
df2 = df.replace(r'^\s*$', np.nan, regex=True)
df3 = df2.dropna()
df4 = df[~df.index.isin(df3.index)]
df5 = df4.groupby(['index']).sum().reset_index()
df_full = pd.concat([df5,df3]).set_index(['index']) #resets your original index
which returns
Column1 Column2 Column3 Column4
index
aa_1 123 789 456 789
aa_2 4567 123
aa_4 123
aa_6
aa_4 234 456 678 987
aa_6 123 456 567 765
You only need to remove the empty row, as I did in the solution, e.g.,
df_full = df_full.replace(r'^\s*$', np.nan, regex=True)
df_full = df_full.dropna(how='all')