Search code examples
pythonpandasdataframeperformancemultiprocessing

[pandas]: most performant way to create foreign keys from one table to another?


i'm working on a very large dataset (4M rows) to transform CSV files to SQL. i need to create joins between several dataframes, using the index of a table as the foreign key to another table. i've found a solution which is quite slow (1800 rows/s, see below), and i'm looking for a faster (more performant) way.

what's the fastest way to create joins between the two following dataframes? (replace df1.street by df2.index where df1.street == df2.street and df1.number == df2.number)

i'm open to any solutions, including multiprocessing (tried it, but no luck at it) as long as it makes things go faster, using as little memory as possible. i've also tried df.merge() and other functions, but no luck so far.

thanks !!!


dataframes

df1

import pandas as pd

df1 = pd.DataFrame({
    'street': {'qr16ef3677886a44f8b9c5bc37dd660688a': 'quai de la Tournelle', 'qr112e28085e3c84c41b6b6a5e13ecf15ac': 'r. AlexandreDumas', 'qr1a213d2a5cbf64718892b3dbb3a9024f1': 'pass. Brunoy', 'qr1fb0760cd0fab4c71a4297d006ec3d119': 'Montmartre', 'qr167fce4c23d5c4b879ca6423cea15e742': 'Martel'}
    , 'number': {'qr16ef3677886a44f8b9c5bc37dd660688a': '33', 'qr112e28085e3c84c41b6b6a5e13ecf15ac': '99', 'qr1a213d2a5cbf64718892b3dbb3a9024f1': '18', 'qr1fb0760cd0fab4c71a4297d006ec3d119': '123', 'qr167fce4c23d5c4b879ca6423cea15e742': '4'}
    , 'date': {'qr16ef3677886a44f8b9c5bc37dd660688a': 1914, 'qr112e28085e3c84c41b6b6a5e13ecf15ac': 1900, 'qr1a213d2a5cbf64718892b3dbb3a9024f1': 1921, 'qr1fb0760cd0fab4c71a4297d006ec3d119': 1858, 'qr167fce4c23d5c4b879ca6423cea15e742': 1896}
})
                                                    street number  date
qr16ef3677886a44f8b9c5bc37dd660688a  quai de la Tournelle     33  1914
qr112e28085e3c84c41b6b6a5e13ecf15ac     r. AlexandreDumas     99  1900
qr1a213d2a5cbf64718892b3dbb3a9024f1          pass. Brunoy     18  1921
qr1fb0760cd0fab4c71a4297d006ec3d119            Montmartre    123  1858
qr167fce4c23d5c4b879ca6423cea15e742                Martel      4  1896 

df2

df2 = pd.DataFrame({
    'number': {'qr152f8de48daa64cf098f44fb3d9e7e145': '123', 'qr18ae0099b6afb48a78d466e5ed6871bec': '18', 'qr183daee61fb98489ebd05556968027a0d': '18', 'qr1e0ee6ec37dbd4e799905db721592ba48': '33', 'qr148505eca183c4fb38f844c35130b92f0': '4'}
    , 'street': {'qr152f8de48daa64cf098f44fb3d9e7e145': 'Montmartre', 'qr18ae0099b6afb48a78d466e5ed6871bec': 'Montmartre', 'qr183daee61fb98489ebd05556968027a0d': 'pass. Brunoy', 'qr1e0ee6ec37dbd4e799905db721592ba48': 'quai de la Tournelle', 'qr148505eca183c4fb38f844c35130b92f0': 'Martel'}
    , 'date': {'qr152f8de48daa64cf098f44fb3d9e7e145': ['1858', '1858'], 'qr18ae0099b6afb48a78d466e5ed6871bec': ['1876', '1881'], 'qr183daee61fb98489ebd05556968027a0d': ['1921', '1921'], 'qr1e0ee6ec37dbd4e799905db721592ba48': ['1914', '1914'], 'qr148505eca183c4fb38f844c35130b92f0': ['1896', '1896']}
}) 

                                     number                street          date
qr152f8de48daa64cf098f44fb3d9e7e145    123            Montmartre  [1858, 1858]
qr18ae0099b6afb48a78d466e5ed6871bec     18            Montmartre  [1876, 1881]
qr183daee61fb98489ebd05556968027a0d     18          pass. Brunoy  [1921, 1921]
qr1e0ee6ec37dbd4e799905db721592ba48     33  quai de la Tournelle  [1914, 1914]
qr148505eca183c4fb38f844c35130b92f0      4                Martel  [1896, 1896]

my solution

my solution relies on using a function (foreignkey()) called inside a df.apply(). however, df.apply() is supposed to be slower than other methods, and my solution comes dangerously close to doing a double loop, on df1 and df2


def foreignkey(ro: pd.core.series.Series) -> pd.core.series.Series:
    """
    replace the address in `ro` of `df1` by a foreign key 
    pointing to `df2`. the key is inserted in `ro.name`
    """
    ro.street = df2.loc[
        ( df2.street == ro.street )    # address has the same street full name
        & ( df2.number == ro.number )  # address has the same street number
    ].index[0]
    return ro

df1 = df1.progress_apply( lambda x: foreignkey(x), axis=1 )

Solution

  • What you described is a classic merge operation (it's called JOIN in the SQL world).

    Try this to see if it speeds up your code:

    tmp = (
        df1.reset_index(names="df1_id")
        .merge(
            df2[["street", "number"]].reset_index(names="df2_id"),
            how="left",
            on=["street", "number"],
        )
        .drop_duplicates("df1_id")
        .set_index("df1_id")
    )
    df1["street_id"] = tmp["df2_id"]
    

    The drop_duplicates is there to keep only the first match when one row in df1 matches multiple rows in df2.