I have multiple data frames, each representing monthly progress. My mission is to join them with two condition step by step. So here is three sample data frame, and I will try to explain what I want to achieve.
import pandas as pd
data_22_3 = {
'id_number': ['A123', 'B456', 'C789'],
'company': ['Insurance1', 'Insurance2', 'Insurance3'],
'type': ['A', 'A', 'C'],
'Income': [100, 200, 300]
}
df_22_3 = pd.DataFrame(data_22_3)
data_22_4 = {
'id_number': ['A123', 'B456', 'D012'],
'company': ['Insurance1', 'Insurance2', 'Insurance1'],
'type': ['A', 'B', 'B'],
'Income': [150, 250, 400]
}
df_22_4 = pd.DataFrame(data_22_4)
data_22_5 = {
'id_number': ['A123', 'C789', 'E034'],
'company': ['Insurance1', 'Insurance3', 'Insurance5'],
'type': ['A', 'C', 'B'],
'Income': [180, 320, 500]
}
df_22_5 = pd.DataFrame(data_22_5)
So, let's take the first data frame as the main data frame. The joining first and second data frame should be as following:
If id_number
of second data frame exists on the first one, then the new column of Income_2
should be added to the first data frame as the next month's income. Since the next data frame has the same columns, all of the columns except Income
should be disregarded.
If id_number
of second data frame does not exist on the first one, then the whole row should be added to the first data frame. The only thing to consider is to place Income
value to the Income_2
column, and putting 0 to the Income
column, since the value belongs to the next month.
The resulting data frame then should be joined with the next dataframe in a similar method and so on.
Even if there is a difference on values of other columns such as Type
or Company
, as long as id_number
is the same, the former data frame value should be taken.
I might be explaining inadequately, but the result is something like this:
data_all = {
'id_number': ['A123', 'B456', 'C789', 'D012', 'E034'],
'company': ['Insurance1', 'Insurance2', 'Insurance3', 'Insurance1', 'Insurance5'],
'type': ['A', 'A', 'C', 'B', 'B'],
'Income': [100, 200, 300, 0, 0],
'Income_2': [150, 250, 0, 400, 0],
'Income_3': [180, 0, 320, 0, 500]
}
all_df = pd.DataFrame(data_all)
Instead of simultaneously, concat first (without Income
) and drop duplicates, then merge each Income
.
dfs = [df_22_3, df_22_4, df_22_5]
result = (
pd.concat(dfs)
.drop(columns='Income')
.drop_duplicates('id_number')
.reset_index(drop=True)
)
For the merge, manually reindexing is the cleanest way I found to fill missing values:
for i, df in enumerate(dfs, start=1):
result[f'Income_{i}'] = (
df
.set_index('id_number')
['Income']
.reindex(result['id_number'], fill_value=0)
.reset_index(drop=True)
)
At this point I have the result's Income
column labelled as Income_1
for consistency, but it can be renamed:
result.rename(columns={'Income_1': 'Income'})
id_number company type Income Income_2 Income_3
0 A123 Insurance1 A 100 150 180
1 B456 Insurance2 A 200 250 0
2 C789 Insurance3 C 300 0 320
3 D012 Insurance1 B 0 400 0
4 E034 Insurance5 B 0 0 500