i'm working on a very large dataset (4M rows) to transform CSV files to SQL. i need to create joins between several dataframes, using the index of a table as the foreign key to another table. i've found a solution which is quite slow (1800 rows/s, see below), and i'm looking for a faster (more performant) way.
df1.street
by df2.index
where df1.street == df2.street
and df1.number == df2.number
)i'm open to any solutions, including multiprocessing (tried it, but no luck at it) as long as it makes things go faster, using as little memory as possible. i've also tried df.merge()
and other functions, but no luck so far.
thanks !!!
import pandas as pd
df1 = pd.DataFrame({
'street': {'qr16ef3677886a44f8b9c5bc37dd660688a': 'quai de la Tournelle', 'qr112e28085e3c84c41b6b6a5e13ecf15ac': 'r. AlexandreDumas', 'qr1a213d2a5cbf64718892b3dbb3a9024f1': 'pass. Brunoy', 'qr1fb0760cd0fab4c71a4297d006ec3d119': 'Montmartre', 'qr167fce4c23d5c4b879ca6423cea15e742': 'Martel'}
, 'number': {'qr16ef3677886a44f8b9c5bc37dd660688a': '33', 'qr112e28085e3c84c41b6b6a5e13ecf15ac': '99', 'qr1a213d2a5cbf64718892b3dbb3a9024f1': '18', 'qr1fb0760cd0fab4c71a4297d006ec3d119': '123', 'qr167fce4c23d5c4b879ca6423cea15e742': '4'}
, 'date': {'qr16ef3677886a44f8b9c5bc37dd660688a': 1914, 'qr112e28085e3c84c41b6b6a5e13ecf15ac': 1900, 'qr1a213d2a5cbf64718892b3dbb3a9024f1': 1921, 'qr1fb0760cd0fab4c71a4297d006ec3d119': 1858, 'qr167fce4c23d5c4b879ca6423cea15e742': 1896}
})
street number date
qr16ef3677886a44f8b9c5bc37dd660688a quai de la Tournelle 33 1914
qr112e28085e3c84c41b6b6a5e13ecf15ac r. AlexandreDumas 99 1900
qr1a213d2a5cbf64718892b3dbb3a9024f1 pass. Brunoy 18 1921
qr1fb0760cd0fab4c71a4297d006ec3d119 Montmartre 123 1858
qr167fce4c23d5c4b879ca6423cea15e742 Martel 4 1896
df2 = pd.DataFrame({
'number': {'qr152f8de48daa64cf098f44fb3d9e7e145': '123', 'qr18ae0099b6afb48a78d466e5ed6871bec': '18', 'qr183daee61fb98489ebd05556968027a0d': '18', 'qr1e0ee6ec37dbd4e799905db721592ba48': '33', 'qr148505eca183c4fb38f844c35130b92f0': '4'}
, 'street': {'qr152f8de48daa64cf098f44fb3d9e7e145': 'Montmartre', 'qr18ae0099b6afb48a78d466e5ed6871bec': 'Montmartre', 'qr183daee61fb98489ebd05556968027a0d': 'pass. Brunoy', 'qr1e0ee6ec37dbd4e799905db721592ba48': 'quai de la Tournelle', 'qr148505eca183c4fb38f844c35130b92f0': 'Martel'}
, 'date': {'qr152f8de48daa64cf098f44fb3d9e7e145': ['1858', '1858'], 'qr18ae0099b6afb48a78d466e5ed6871bec': ['1876', '1881'], 'qr183daee61fb98489ebd05556968027a0d': ['1921', '1921'], 'qr1e0ee6ec37dbd4e799905db721592ba48': ['1914', '1914'], 'qr148505eca183c4fb38f844c35130b92f0': ['1896', '1896']}
})
number street date
qr152f8de48daa64cf098f44fb3d9e7e145 123 Montmartre [1858, 1858]
qr18ae0099b6afb48a78d466e5ed6871bec 18 Montmartre [1876, 1881]
qr183daee61fb98489ebd05556968027a0d 18 pass. Brunoy [1921, 1921]
qr1e0ee6ec37dbd4e799905db721592ba48 33 quai de la Tournelle [1914, 1914]
qr148505eca183c4fb38f844c35130b92f0 4 Martel [1896, 1896]
my solution relies on using a function (foreignkey()
) called inside a df.apply()
. however, df.apply()
is supposed to be slower than other methods, and my solution comes dangerously close to doing a double loop, on df1
and df2
def foreignkey(ro: pd.core.series.Series) -> pd.core.series.Series:
"""
replace the address in `ro` of `df1` by a foreign key
pointing to `df2`. the key is inserted in `ro.name`
"""
ro.street = df2.loc[
( df2.street == ro.street ) # address has the same street full name
& ( df2.number == ro.number ) # address has the same street number
].index[0]
return ro
df1 = df1.progress_apply( lambda x: foreignkey(x), axis=1 )
What you described is a classic merge
operation (it's called JOIN
in the SQL world).
Try this to see if it speeds up your code:
tmp = (
df1.reset_index(names="df1_id")
.merge(
df2[["street", "number"]].reset_index(names="df2_id"),
how="left",
on=["street", "number"],
)
.drop_duplicates("df1_id")
.set_index("df1_id")
)
df1["street_id"] = tmp["df2_id"]
The drop_duplicates
is there to keep only the first match when one row in df1
matches multiple rows in df2
.