join/merge multiple pandas dataframes with blending of values on one column

I have two pandas dataframes width unique column names except:

year
student_id
student_name

I would like to merge on these 3 columns (year, student_id, student_name), however sometimes the student_name is misspelled (or has a different spelling). Essentially I would like to merge on year & column, while preserving the student_name column.

wrt varying student_name, I am indifferent as to which student_name is chosen (although it would be nice if it chose the more frequently occurring student_name, but I don't want to ask for too much). I would prefer the merged/final dataframe to use only ONE version of the student_name (per student_id), but am willing to settle otherwise :)

Example:

>>> df_A
   year  student_id    student_name  exam_A
0  2023       12345  Chris P. Bacon      80
1  2024       12345     Chris Bacon      90
2  2024       33333      Noah Buddy      90
3  2021       55555  Faye Kipperson      99
4  2024       11111     Beau Gusman      75

>>> df_B
   year  student_id    student_name  exam_B  exam_C
0  2024       12345  Chris P. Bacon      90      75
1  2024       33333      Noah Buddy      88      77
2  2020       88888    Saul Goodman      86      88
3  2023       88888    Saul Goodman      99      79
4  2024       55555   Fay Kipperson      82      75
5  2024       11111     Beau Gusman      80      99

What I want is this =>

   year  student_id    student_name  exam_A  exam_B  exam_C
0  2020       88888    Saul Goodman     NaN    86.0    88.0
1  2021       55555  Faye Kipperson    99.0     NaN     NaN
2  2023       12345  Chris P. Bacon    80.0     NaN     NaN
3  2023       88888    Saul Goodman     NaN    99.0    79.0
4  2024       11111     Beau Gusman    75.0    80.0    99.0
5  2024       12345     Chris Bacon    90.0    90.0    75.0
6  2024       33333      Noah Buddy    90.0    88.0    77.0
7  2024       55555  Faye Kipperson     NaN    82.0    75.0

I DON'T WANT multiple student_name columns. Currently, I am merging one dataframe at a time & then going back through and filling in the null student_name cells.

Question #2: I actually have about 33 dataframes to merge (all with unique column names except year, student_id, student_name), is there a way for me to merge them in this same fashion all at once (as opposed to merging each one in individually)?

Thank you very much!!!

I have tried:

>>> pd.merge(df_A, df_B, on=[ 'year', 'student_id', ], how='outer')
   year  student_id  student_name_x  exam_A  student_name_y  exam_B  exam_C
0  2020       88888             NaN     NaN    Saul Goodman    86.0    88.0
1  2021       55555  Faye Kipperson    99.0             NaN     NaN     NaN
2  2023       12345  Chris P. Bacon    80.0             NaN     NaN     NaN
3  2023       88888             NaN     NaN    Saul Goodman    99.0    79.0
4  2024       11111     Beau Gusman    75.0     Beau Gusman    80.0    99.0
5  2024       12345     Chris Bacon    90.0  Chris P. Bacon    90.0    75.0
6  2024       33333      Noah Buddy    90.0      Noah Buddy    88.0    77.0
7  2024       55555             NaN     NaN   Fay Kipperson    82.0    75.0


>>> pd.merge(df_A, df_B.drop(columns=[ 'student_name' ]), on=[ 'year', 'student_id', ], how='outer')
   year  student_id    student_name  exam_A  exam_B  exam_C
0  2020       88888             NaN     NaN    86.0    88.0
1  2021       55555  Faye Kipperson    99.0     NaN     NaN
2  2023       12345  Chris P. Bacon    80.0     NaN     NaN
3  2023       88888             NaN     NaN    99.0    79.0
4  2024       11111     Beau Gusman    75.0    80.0    99.0
5  2024       12345     Chris Bacon    90.0    90.0    75.0
6  2024       33333      Noah Buddy    90.0    88.0    77.0
7  2024       55555             NaN     NaN    82.0    75.0

Solution

To get the most common name, concatenate all the student ids and names, group on 'student_id' and aggregate with mode.
Do an outer merge of left and right dataframes excluding column 'student_name'.
To streamline this, you can create a function that takes a list of dataframes, and returns the resulting dataframe.

As an example, including also a df_C:

   year  student_id    student_name  exam_X  exam_Z
0  2024       12345  Chris P. Bacon      90      75
1  2024       33333      Noah Buddy      88      77
2  2020       88888    Saul Goodman      86      88
3  2023       88888    Saul Goodman      99      79
4  2024       55555   Fay Kipperson      82      75
5  2024       11111     Beau Gusman      80      99

You can use something like this:

def merge_grades(dfs: list[pd.DataFrame]):
    df_students = (
        pd.concat([df[["student_id", "student_name"]] for df in dfs])
        .groupby("student_id", as_index=False)
        .agg(lambda x: x.mode()[0])
    )

    df_left = dfs[0].drop(columns="student_name")
    for df_right in dfs[1:]:
        df_left = pd.merge(
            df_left,
            df_right.drop(columns="student_name"),
            how="outer",
        )

    return df_left.merge(df_students)


df = merge_grades([df_A, df_B, df_C])

   year  student_id  exam_A  exam_B  exam_C  exam_X  exam_Z    student_name
0  2020       88888     NaN    86.0    88.0    86.0    88.0    Saul Goodman
1  2021       55555    99.0     NaN     NaN     NaN     NaN   Fay Kipperson
2  2023       12345    80.0     NaN     NaN     NaN     NaN  Chris P. Bacon
3  2023       88888     NaN    99.0    79.0    99.0    79.0    Saul Goodman
4  2024       11111    75.0    80.0    99.0    80.0    99.0     Beau Gusman
5  2024       12345    90.0    90.0    75.0    90.0    75.0  Chris P. Bacon
6  2024       33333    90.0    88.0    77.0    88.0    77.0      Noah Buddy
7  2024       55555     NaN    82.0    75.0    82.0    75.0   Fay Kipperson