I have two pandas dataframes looking like this
ID | name | |
---|---|---|
1 | "firstname.lastname@provider.com" | "firstname lastname" |
... | ... | ... |
5150 | "firstname.lastname@provider.com" | "firstname lastname" |
with roughly 5150 rows. The data is not cleaned in the sense that it may happen that the name
column contains names with typos, additional spaces, written in camelcase. Also it may be that the name is an empty string.
The second dataframe contains information like this
Id | Name | To | To_Name | |
---|---|---|---|---|
1 | "firstname.lastname@provider.com" | "firstname lastname" | "firstname.lastname@provider.com" | "firstname lastname" |
... | ... | ... | ||
8500 | "firstname.lastname@provider.com" | "firstname lastname" | "firstname.lastname@provider.com" | "firstname lastname" |
with roughly 8500 rows. Here, the name
columns have the same issue as the first dataframe.
I now want to create a new dataframe out of the previous two dataframes in the sense of relational databases, i.e. in the form
ID | From | To |
---|---|---|
1 | 1 | 2 |
2 | 4 | 8 |
where the ID
column refers to the ID
column of the second dataframe and the values in the From
and To
column refer to the first dataframe where we map names onto integers.
The code below runs but takes roughly one minute. Do you guys have ideas how I could speed it up?
Id_new = []
From_new = []
To_new = []
for i in range(0,len(second_df['Id'])):
Id_new.append(second_df['Id'].iloc[i])
email = second_df['Email'].iloc[i]
name = second_df['Name'].iloc[i]
testdf = first_Df.where(first_Df['Email'] == email).dropna()
value = int(testdf.loc[testdf['Name'] == name].iloc[0].at["ID"])
From_new.append(value)
emailto = second_df['To'].iloc[i]
nameto = second_df['To_Name'].iloc[i]
testdf = first_Df.where(first_Df['Email'] == emailto).dropna()
valueto = int(testdf.loc[testdf['Name'] == nameto].iloc[0].at["ID"])
To_new.append(valueto)
return output_df = pd.DataFrame(list(zip(Id_new, From_new, To_new)),
columns = ['ID', 'From', 'To'])
You should try to avoid running for loops when dealing with pandas DataFrames, most of the times there is a better way of doing it. In this case you probably want to use merge
Merge, join, concatenate and compare
You can merge first on email
and name
and then on to
and to_name
, something like this:
df1 = pd.DataFrame(
{"ID": ["1", "2", "3"], "email": ["a", "b", "c"], "name": ["x", "y", "z"]}
)
df2 = pd.DataFrame(
{
"Id": ["1", "2", "3", "4"],
"email": ["a", "b", "c", "d"],
"name": ["x", "y", "z", "k"],
"to": ["m", "a", "b", "p"],
"to_name": ["r", "x", "y", "u"],
}
)
new_df = (
df2.merge(df1[["ID", "email", "name"]], on=["email", "name"], how="left")
.rename(columns={"ID": "From"})
.merge(df1, right_on=["email", "name"], left_on=["to", "to_name"], how="left")
.rename(columns={"ID": "To"})[["Id", "From", "To"]]
)