Search code examples
pythonpandasconditional-statementsapply

How to merge two dataframes on an interval?


I have two dataframes, mp and sp.

mp contains state route (int), ~1/10 milepost (float) marks, latitude (float), longitude (float).

            route  milepost         lon         lat
2091        2      0.0              -122.189880 47.979063
2667        2      0.1              -122.188067 47.979760
2159        2      0.2              -122.186210 47.979560
1174        2      0.3              -122.184197 47.979098
1878        2      0.4              -122.182098 47.978718

sp has state route (int), ~ milepost (float) where the speed limit begins and ends, and speed limit (int).

      route    begin       end          limit
3119  2        0.000000    0.130000     30
2787  2        0.130000    2.880000     55
3184  2        2.880000    12.080000    60
3035  2        12.080000   12.720000    55
2780  2        12.720000   14.220000    45
2777  2        14.220000   15.450000    35
2774  2        15.450000   21.500000    55

I want to have one dataframe (sp + speed limit), but cannot join on the state route and milepost. How can I apply the speed limit to each milepost in mp based on the interval? That is, for state route 2, milepost 0, 0.1, speed limit = 30; milepost 0.2, 0.3, ..., 2.8, speed limit = 55; etc.

            route  milepost         lon         lat       limit
2091        2      0.0              -122.189880 47.979063 30
2667        2      0.1              -122.188067 47.979760 55
2159        2      0.2              -122.186210 47.979560 55
...
2789        2      2.8              -122.134982 47.972241 60
1388        2      2.9              -122.133974 47.970995 60
1609        2      3.0              -122.132964 47.969750 60

I have tried pandas merge, but lost many rows. Tried to fuzzy join using fuzzypanda, but get an error that fuzzypanda.matching doesn't exist (the one method it should have!), so cannot create fuzzy column to merge on. The following didn't work out either:

for route in mp['route']:
    for milepost in mp['milepost']:
        if (route == sp['route']) & (milepost >= sp['begin']) & (milepost <= sp['end']):
            mp.insert(1, 'limit', sp['limit'][0])

Any approach to the problem is welcome; no need to go the way I was trying! Thank you!


Solution

  • I have a first solution to start over, but with a few weaknesses:

    • not 100% Pythonic since it still analyzes your mp DataFrame row by row with a for loop, based on the "reference" table provided by the sp DataFrame.
    • perhaps not "bulletproof" robust as it is relies on the assumption that there is exactly one matching "conditional" row of sp for each ("state route", "milepost") couple of the mp DataFrame.

    Sample data

    I added two lines to your mp DataFrame such that mp and sp do not share the same length.

    import pandas as pd
    
    mp_data = {
        "state route": [2, 2, 2, 2, 2, 2, 2],
        "milepost": [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6],
        "latitude": [
            -122.18988,
            -122.18807,
            -122.18621,
            -122.1842,
            -122.1821,
            -122.18,
            -122.175,
        ],
        "longitude": [47.97906, 47.97976, 47.97956, 47.9791, 47.97872, 47.978, 47.9775],
    }
    df_mp = pd.DataFrame(mp_data)
    
    sp_data = {
        "state route": [2, 2, 2, 2, 2],
        "start speed limit milepost": [0, 0.13, 2.88, 12.08, 12.72],
        "stop speed limit milepost": [0.13, 2.88, 12.08, 12.72, 14.22],
        "speed limit": [30, 55, 60, 55, 45],
    }
    df_sp = pd.DataFrame(sp_data)
    
    

    checking what it looks like

    print(df_mp)
    
       state route  milepost   latitude  longitude
    0            2       0.0 -122.18988   47.97906
    1            2       0.1 -122.18807   47.97976
    2            2       0.2 -122.18621   47.97956
    3            2       0.3 -122.18420   47.97910
    4            2       0.4 -122.18210   47.97872
    5            2       0.5 -122.18000   47.97800
    6            2       0.6 -122.17500   47.97750
    
    print(df_sp)
    
       state route  start speed limit milepost  stop speed limit milepost  \
    0            2                        0.00                       0.13   
    1            2                        0.13                       2.88   
    2            2                        2.88                      12.08   
    3            2                       12.08                      12.72   
    4            2                       12.72                      14.22   
    
       speed limit  
    0           30  
    1           55  
    2           60  
    3           55  
    4           45 
    

    Adding speed limitations to df_mp DataFrame

    list_sl = [] 
    # comment: preparing a list that will be filled row by row
    # is also an unsatisfactory method that could be improved  
    for _, row in df_mp.iterrows():
        list_sl.append(
            (
                df_sp["speed limit"].loc[
                    (row["state route"] == df_sp["state route"])
                    & (row["milepost"] >= df_sp["start speed limit milepost"])
                    & (row["milepost"] < df_sp["stop speed limit milepost"])
                ]
            ).iloc[0] # works only because we assume that there is exacly one solution
        )
    df_mp["speed limit"] = list_sl
    

    Result

    print(df_mp)
    
       state route  milepost   latitude  longitude  speed limit
    0            2       0.0 -122.18988   47.97906           30
    1            2       0.1 -122.18807   47.97976           30
    2            2       0.2 -122.18621   47.97956           55
    3            2       0.3 -122.18420   47.97910           55
    4            2       0.4 -122.18210   47.97872           55
    5            2       0.5 -122.18000   47.97800           55
    6            2       0.6 -122.17500   47.97750           55