Search code examples
pythonpandasdataframejoinmerge

join/merge multiple pandas dataframes with blending of values on one column


I have two pandas dataframes width unique column names except:

  • year
  • student_id
  • student_name

I would like to merge on these 3 columns (year, student_id, student_name), however sometimes the student_name is misspelled (or has a different spelling). Essentially I would like to merge on year & column, while preserving the student_name column.

wrt varying student_name, I am indifferent as to which student_name is chosen (although it would be nice if it chose the more frequently occurring student_name, but I don't want to ask for too much). I would prefer the merged/final dataframe to use only ONE version of the student_name (per student_id), but am willing to settle otherwise :)

Example:

>>> df_A
   year  student_id    student_name  exam_A
0  2023       12345  Chris P. Bacon      80
1  2024       12345     Chris Bacon      90
2  2024       33333      Noah Buddy      90
3  2021       55555  Faye Kipperson      99
4  2024       11111     Beau Gusman      75

>>> df_B
   year  student_id    student_name  exam_B  exam_C
0  2024       12345  Chris P. Bacon      90      75
1  2024       33333      Noah Buddy      88      77
2  2020       88888    Saul Goodman      86      88
3  2023       88888    Saul Goodman      99      79
4  2024       55555   Fay Kipperson      82      75
5  2024       11111     Beau Gusman      80      99

What I want is this =>

   year  student_id    student_name  exam_A  exam_B  exam_C
0  2020       88888    Saul Goodman     NaN    86.0    88.0
1  2021       55555  Faye Kipperson    99.0     NaN     NaN
2  2023       12345  Chris P. Bacon    80.0     NaN     NaN
3  2023       88888    Saul Goodman     NaN    99.0    79.0
4  2024       11111     Beau Gusman    75.0    80.0    99.0
5  2024       12345     Chris Bacon    90.0    90.0    75.0
6  2024       33333      Noah Buddy    90.0    88.0    77.0
7  2024       55555  Faye Kipperson     NaN    82.0    75.0

I DON'T WANT multiple student_name columns. Currently, I am merging one dataframe at a time & then going back through and filling in the null student_name cells.

Question #2: I actually have about 33 dataframes to merge (all with unique column names except year, student_id, student_name), is there a way for me to merge them in this same fashion all at once (as opposed to merging each one in individually)?

Thank you very much!!!

I have tried:

>>> pd.merge(df_A, df_B, on=[ 'year', 'student_id', ], how='outer')
   year  student_id  student_name_x  exam_A  student_name_y  exam_B  exam_C
0  2020       88888             NaN     NaN    Saul Goodman    86.0    88.0
1  2021       55555  Faye Kipperson    99.0             NaN     NaN     NaN
2  2023       12345  Chris P. Bacon    80.0             NaN     NaN     NaN
3  2023       88888             NaN     NaN    Saul Goodman    99.0    79.0
4  2024       11111     Beau Gusman    75.0     Beau Gusman    80.0    99.0
5  2024       12345     Chris Bacon    90.0  Chris P. Bacon    90.0    75.0
6  2024       33333      Noah Buddy    90.0      Noah Buddy    88.0    77.0
7  2024       55555             NaN     NaN   Fay Kipperson    82.0    75.0


>>> pd.merge(df_A, df_B.drop(columns=[ 'student_name' ]), on=[ 'year', 'student_id', ], how='outer')
   year  student_id    student_name  exam_A  exam_B  exam_C
0  2020       88888             NaN     NaN    86.0    88.0
1  2021       55555  Faye Kipperson    99.0     NaN     NaN
2  2023       12345  Chris P. Bacon    80.0     NaN     NaN
3  2023       88888             NaN     NaN    99.0    79.0
4  2024       11111     Beau Gusman    75.0    80.0    99.0
5  2024       12345     Chris Bacon    90.0    90.0    75.0
6  2024       33333      Noah Buddy    90.0    88.0    77.0
7  2024       55555             NaN     NaN    82.0    75.0

Solution

    1. To get the most common name, concatenate all the student ids and names, group on 'student_id' and aggregate with mode.
    2. Do an outer merge of left and right dataframes excluding column 'student_name'.
    3. To streamline this, you can create a function that takes a list of dataframes, and returns the resulting dataframe.

    As an example, including also a df_C:

       year  student_id    student_name  exam_X  exam_Z
    0  2024       12345  Chris P. Bacon      90      75
    1  2024       33333      Noah Buddy      88      77
    2  2020       88888    Saul Goodman      86      88
    3  2023       88888    Saul Goodman      99      79
    4  2024       55555   Fay Kipperson      82      75
    5  2024       11111     Beau Gusman      80      99
    

    You can use something like this:

    def merge_grades(dfs: list[pd.DataFrame]):
        df_students = (
            pd.concat([df[["student_id", "student_name"]] for df in dfs])
            .groupby("student_id", as_index=False)
            .agg(lambda x: x.mode()[0])
        )
    
        df_left = dfs[0].drop(columns="student_name")
        for df_right in dfs[1:]:
            df_left = pd.merge(
                df_left,
                df_right.drop(columns="student_name"),
                how="outer",
            )
    
        return df_left.merge(df_students)
    
    
    df = merge_grades([df_A, df_B, df_C])
    
       year  student_id  exam_A  exam_B  exam_C  exam_X  exam_Z    student_name
    0  2020       88888     NaN    86.0    88.0    86.0    88.0    Saul Goodman
    1  2021       55555    99.0     NaN     NaN     NaN     NaN   Fay Kipperson
    2  2023       12345    80.0     NaN     NaN     NaN     NaN  Chris P. Bacon
    3  2023       88888     NaN    99.0    79.0    99.0    79.0    Saul Goodman
    4  2024       11111    75.0    80.0    99.0    80.0    99.0     Beau Gusman
    5  2024       12345    90.0    90.0    75.0    90.0    75.0  Chris P. Bacon
    6  2024       33333    90.0    88.0    77.0    88.0    77.0      Noah Buddy
    7  2024       55555     NaN    82.0    75.0    82.0    75.0   Fay Kipperson