I have two dataframes, mp and sp.
mp contains state route (int), ~1/10 milepost (float) marks, latitude (float), longitude (float).
route milepost lon lat
2091 2 0.0 -122.189880 47.979063
2667 2 0.1 -122.188067 47.979760
2159 2 0.2 -122.186210 47.979560
1174 2 0.3 -122.184197 47.979098
1878 2 0.4 -122.182098 47.978718
sp has state route (int), ~ milepost (float) where the speed limit begins and ends, and speed limit (int).
route begin end limit
3119 2 0.000000 0.130000 30
2787 2 0.130000 2.880000 55
3184 2 2.880000 12.080000 60
3035 2 12.080000 12.720000 55
2780 2 12.720000 14.220000 45
2777 2 14.220000 15.450000 35
2774 2 15.450000 21.500000 55
I want to have one dataframe (sp + speed limit), but cannot join on the state route and milepost. How can I apply the speed limit to each milepost in mp based on the interval? That is, for state route 2, milepost 0, 0.1, speed limit = 30; milepost 0.2, 0.3, ..., 2.8, speed limit = 55; etc.
route milepost lon lat limit
2091 2 0.0 -122.189880 47.979063 30
2667 2 0.1 -122.188067 47.979760 55
2159 2 0.2 -122.186210 47.979560 55
...
2789 2 2.8 -122.134982 47.972241 60
1388 2 2.9 -122.133974 47.970995 60
1609 2 3.0 -122.132964 47.969750 60
I have tried pandas merge, but lost many rows. Tried to fuzzy join using fuzzypanda, but get an error that fuzzypanda.matching doesn't exist (the one method it should have!), so cannot create fuzzy column to merge on. The following didn't work out either:
for route in mp['route']:
for milepost in mp['milepost']:
if (route == sp['route']) & (milepost >= sp['begin']) & (milepost <= sp['end']):
mp.insert(1, 'limit', sp['limit'][0])
Any approach to the problem is welcome; no need to go the way I was trying! Thank you!
I have a first solution to start over, but with a few weaknesses:
mp
DataFrame row by row with a for
loop, based on the "reference" table provided by the sp
DataFrame.sp
for each ("state route", "milepost")
couple of the mp
DataFrame.I added two lines to your mp
DataFrame such that mp
and sp
do not share the same length.
import pandas as pd
mp_data = {
"state route": [2, 2, 2, 2, 2, 2, 2],
"milepost": [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6],
"latitude": [
-122.18988,
-122.18807,
-122.18621,
-122.1842,
-122.1821,
-122.18,
-122.175,
],
"longitude": [47.97906, 47.97976, 47.97956, 47.9791, 47.97872, 47.978, 47.9775],
}
df_mp = pd.DataFrame(mp_data)
sp_data = {
"state route": [2, 2, 2, 2, 2],
"start speed limit milepost": [0, 0.13, 2.88, 12.08, 12.72],
"stop speed limit milepost": [0.13, 2.88, 12.08, 12.72, 14.22],
"speed limit": [30, 55, 60, 55, 45],
}
df_sp = pd.DataFrame(sp_data)
print(df_mp)
state route milepost latitude longitude
0 2 0.0 -122.18988 47.97906
1 2 0.1 -122.18807 47.97976
2 2 0.2 -122.18621 47.97956
3 2 0.3 -122.18420 47.97910
4 2 0.4 -122.18210 47.97872
5 2 0.5 -122.18000 47.97800
6 2 0.6 -122.17500 47.97750
print(df_sp)
state route start speed limit milepost stop speed limit milepost \
0 2 0.00 0.13
1 2 0.13 2.88
2 2 2.88 12.08
3 2 12.08 12.72
4 2 12.72 14.22
speed limit
0 30
1 55
2 60
3 55
4 45
df_mp
DataFramelist_sl = []
# comment: preparing a list that will be filled row by row
# is also an unsatisfactory method that could be improved
for _, row in df_mp.iterrows():
list_sl.append(
(
df_sp["speed limit"].loc[
(row["state route"] == df_sp["state route"])
& (row["milepost"] >= df_sp["start speed limit milepost"])
& (row["milepost"] < df_sp["stop speed limit milepost"])
]
).iloc[0] # works only because we assume that there is exacly one solution
)
df_mp["speed limit"] = list_sl
print(df_mp)
state route milepost latitude longitude speed limit
0 2 0.0 -122.18988 47.97906 30
1 2 0.1 -122.18807 47.97976 30
2 2 0.2 -122.18621 47.97956 55
3 2 0.3 -122.18420 47.97910 55
4 2 0.4 -122.18210 47.97872 55
5 2 0.5 -122.18000 47.97800 55
6 2 0.6 -122.17500 47.97750 55