I have two pandas dataframes width unique column names except:
I would like to merge on these 3 columns (year, student_id, student_name), however sometimes the student_name is misspelled (or has a different spelling). Essentially I would like to merge on year & column, while preserving the student_name column.
wrt varying student_name, I am indifferent as to which student_name is chosen (although it would be nice if it chose the more frequently occurring student_name, but I don't want to ask for too much). I would prefer the merged/final dataframe to use only ONE version of the student_name (per student_id), but am willing to settle otherwise :)
Example:
>>> df_A
year student_id student_name exam_A
0 2023 12345 Chris P. Bacon 80
1 2024 12345 Chris Bacon 90
2 2024 33333 Noah Buddy 90
3 2021 55555 Faye Kipperson 99
4 2024 11111 Beau Gusman 75
>>> df_B
year student_id student_name exam_B exam_C
0 2024 12345 Chris P. Bacon 90 75
1 2024 33333 Noah Buddy 88 77
2 2020 88888 Saul Goodman 86 88
3 2023 88888 Saul Goodman 99 79
4 2024 55555 Fay Kipperson 82 75
5 2024 11111 Beau Gusman 80 99
What I want is this =>
year student_id student_name exam_A exam_B exam_C
0 2020 88888 Saul Goodman NaN 86.0 88.0
1 2021 55555 Faye Kipperson 99.0 NaN NaN
2 2023 12345 Chris P. Bacon 80.0 NaN NaN
3 2023 88888 Saul Goodman NaN 99.0 79.0
4 2024 11111 Beau Gusman 75.0 80.0 99.0
5 2024 12345 Chris Bacon 90.0 90.0 75.0
6 2024 33333 Noah Buddy 90.0 88.0 77.0
7 2024 55555 Faye Kipperson NaN 82.0 75.0
I DON'T WANT multiple student_name columns. Currently, I am merging one dataframe at a time & then going back through and filling in the null student_name cells.
Question #2: I actually have about 33 dataframes to merge (all with unique column names except year, student_id, student_name), is there a way for me to merge them in this same fashion all at once (as opposed to merging each one in individually)?
Thank you very much!!!
I have tried:
>>> pd.merge(df_A, df_B, on=[ 'year', 'student_id', ], how='outer')
year student_id student_name_x exam_A student_name_y exam_B exam_C
0 2020 88888 NaN NaN Saul Goodman 86.0 88.0
1 2021 55555 Faye Kipperson 99.0 NaN NaN NaN
2 2023 12345 Chris P. Bacon 80.0 NaN NaN NaN
3 2023 88888 NaN NaN Saul Goodman 99.0 79.0
4 2024 11111 Beau Gusman 75.0 Beau Gusman 80.0 99.0
5 2024 12345 Chris Bacon 90.0 Chris P. Bacon 90.0 75.0
6 2024 33333 Noah Buddy 90.0 Noah Buddy 88.0 77.0
7 2024 55555 NaN NaN Fay Kipperson 82.0 75.0
>>> pd.merge(df_A, df_B.drop(columns=[ 'student_name' ]), on=[ 'year', 'student_id', ], how='outer')
year student_id student_name exam_A exam_B exam_C
0 2020 88888 NaN NaN 86.0 88.0
1 2021 55555 Faye Kipperson 99.0 NaN NaN
2 2023 12345 Chris P. Bacon 80.0 NaN NaN
3 2023 88888 NaN NaN 99.0 79.0
4 2024 11111 Beau Gusman 75.0 80.0 99.0
5 2024 12345 Chris Bacon 90.0 90.0 75.0
6 2024 33333 Noah Buddy 90.0 88.0 77.0
7 2024 55555 NaN NaN 82.0 75.0
'student_id'
and aggregate with mode
.'student_name'
.As an example, including also a df_C
:
year student_id student_name exam_X exam_Z
0 2024 12345 Chris P. Bacon 90 75
1 2024 33333 Noah Buddy 88 77
2 2020 88888 Saul Goodman 86 88
3 2023 88888 Saul Goodman 99 79
4 2024 55555 Fay Kipperson 82 75
5 2024 11111 Beau Gusman 80 99
You can use something like this:
def merge_grades(dfs: list[pd.DataFrame]):
df_students = (
pd.concat([df[["student_id", "student_name"]] for df in dfs])
.groupby("student_id", as_index=False)
.agg(lambda x: x.mode()[0])
)
df_left = dfs[0].drop(columns="student_name")
for df_right in dfs[1:]:
df_left = pd.merge(
df_left,
df_right.drop(columns="student_name"),
how="outer",
)
return df_left.merge(df_students)
df = merge_grades([df_A, df_B, df_C])
year student_id exam_A exam_B exam_C exam_X exam_Z student_name
0 2020 88888 NaN 86.0 88.0 86.0 88.0 Saul Goodman
1 2021 55555 99.0 NaN NaN NaN NaN Fay Kipperson
2 2023 12345 80.0 NaN NaN NaN NaN Chris P. Bacon
3 2023 88888 NaN 99.0 79.0 99.0 79.0 Saul Goodman
4 2024 11111 75.0 80.0 99.0 80.0 99.0 Beau Gusman
5 2024 12345 90.0 90.0 75.0 90.0 75.0 Chris P. Bacon
6 2024 33333 90.0 88.0 77.0 88.0 77.0 Noah Buddy
7 2024 55555 NaN 82.0 75.0 82.0 75.0 Fay Kipperson