Search code examples
pythonpandasdataframemergeconcatenation

Merging and concatinating simultaneously


I have multiple data frames, each representing monthly progress. My mission is to join them with two condition step by step. So here is three sample data frame, and I will try to explain what I want to achieve.

import pandas as pd

data_22_3 = {
    'id_number': ['A123', 'B456', 'C789'],
    'company': ['Insurance1', 'Insurance2', 'Insurance3'],
    'type': ['A', 'A', 'C'],
    'Income': [100, 200, 300]
}
df_22_3 = pd.DataFrame(data_22_3)

data_22_4 = {
    'id_number': ['A123', 'B456', 'D012'],
    'company': ['Insurance1', 'Insurance2', 'Insurance1'],
    'type': ['A', 'B', 'B'],
    'Income': [150, 250, 400]
}
df_22_4 = pd.DataFrame(data_22_4)

data_22_5 = {
    'id_number': ['A123', 'C789', 'E034'],
    'company': ['Insurance1', 'Insurance3', 'Insurance5'],
    'type': ['A', 'C', 'B'],
    'Income': [180, 320, 500]
}
df_22_5 = pd.DataFrame(data_22_5)

So, let's take the first data frame as the main data frame. The joining first and second data frame should be as following:

If id_number of second data frame exists on the first one, then the new column of Income_2 should be added to the first data frame as the next month's income. Since the next data frame has the same columns, all of the columns except Income should be disregarded.

If id_number of second data frame does not exist on the first one, then the whole row should be added to the first data frame. The only thing to consider is to place Income value to the Income_2 column, and putting 0 to the Income column, since the value belongs to the next month.

The resulting data frame then should be joined with the next dataframe in a similar method and so on.

Even if there is a difference on values of other columns such as Type or Company, as long as id_number is the same, the former data frame value should be taken.

I might be explaining inadequately, but the result is something like this:

data_all = {
    'id_number': ['A123', 'B456', 'C789', 'D012', 'E034'],
    'company': ['Insurance1', 'Insurance2', 'Insurance3', 'Insurance1', 'Insurance5'],
    'type': ['A', 'A', 'C', 'B', 'B'],
    'Income': [100, 200, 300, 0, 0],
    'Income_2': [150, 250, 0, 400, 0],
    'Income_3': [180, 0, 320, 0, 500]
}

all_df = pd.DataFrame(data_all)

Solution

  • Instead of simultaneously, concat first (without Income) and drop duplicates, then merge each Income.

    dfs = [df_22_3, df_22_4, df_22_5]
    result = (
        pd.concat(dfs)
        .drop(columns='Income')
        .drop_duplicates('id_number')
        .reset_index(drop=True)
    )
    

    For the merge, manually reindexing is the cleanest way I found to fill missing values:

    for i, df in enumerate(dfs, start=1):
        result[f'Income_{i}'] = (
            df
            .set_index('id_number')
            ['Income']
            .reindex(result['id_number'], fill_value=0)
            .reset_index(drop=True)
        )
    

    At this point I have the result's Income column labelled as Income_1 for consistency, but it can be renamed:

    result.rename(columns={'Income_1': 'Income'})
    
      id_number     company type  Income  Income_2  Income_3
    0      A123  Insurance1    A     100       150       180
    1      B456  Insurance2    A     200       250         0
    2      C789  Insurance3    C     300         0       320
    3      D012  Insurance1    B       0       400         0
    4      E034  Insurance5    B       0         0       500