Search code examples
pythonpandasdataframeapply

Comparison between rows based on custom function


I have a large size data frame such as below:

Vehicle  longitude latitude trip 
  0        33        155     0
  0        34        156     1
  1        32        154     2
  1        37        154     5
  2        25        145     2
  .        .          .      .
  .        .          .      .

I also defined a custom boolean function to check if coordination is inside a specific area.

def check(main_vehicle_latitude,main_vehicle_longitude,radius,compare_vehicle_latitdude,compare_vehicle_longitude):
  if condition:
     x=True
  return X

Now I want to apply this function to (each row) of my data frame in a way that I compare each vehicle/trip with all other vehicle coordinates and find all the vehicles that have a similar trip location so the final output would be a list for each vehicle that includes all other vehicles that have a similar trip location. For example, the coordinates of vehicle (0) and trip (0) should be compared with all other vehicles to find a list of all vehicles that have similar start coordinates with the first vehicle (trip 0) and continue to check this for all vehicle trips. It seems a bit complicated to explain but hopefully, it was clear enough. I'm looking for a very efficient way since the data frame is large but unfortunately, I'm a beginner so any help with this would be greatly appreciated.


Solution

  • With the following toy dataframe:

    import pandas as pd
    
    df = pd.DataFrame(
        {
            "vehicle": [0, 0, 1, 1, 2, 3, 4, 5],
            "longitude": [33, 34, 32, 37, 25, 33, 37, 33],
            "latitude": [155, 156, 154, 154, 145, 155, 154, 155],
            "trip": [0, 1, 2, 5, 2, 1, 0, 6],
        }
    )
    
    print(df)
    # Output
       vehicle  longitude  latitude  trip
    0        0         33       155     0
    1        0         34       156     1
    2        1         32       154     2
    3        1         37       154     5
    4        2         25       145     2
    5        3         33       155     1
    6        4         37       154     0
    7        5         33       155     6
    

    Here is one way to do it with Pandas groupby, explode, and concat:

    # Find matches
    tmp = df.groupby(["longitude", "latitude"]).agg(list).reset_index(drop=True)
    tmp["match"] = tmp.apply(lambda x: 1 if len(x["vehicle"]) > 1 else pd.NA, axis=1)
    tmp = tmp.dropna()
    
    # Format results
    tmp["match"] = tmp.apply(
        lambda x: [[v, t] for v, t in zip(x["vehicle"], x["trip"])], axis=1
    )
    tmp = tmp.explode("vehicle")
    tmp = tmp.explode("trip")
    tmp["match"] = tmp.apply(
        lambda x: x["match"] if [x["vehicle"], x["trip"]] in x["match"] else pd.NA, axis=1
    )
    tmp = tmp.dropna()
    tmp["match"] = tmp.apply(
        lambda x: [p[0] for p in x["match"] if p[0] != x["vehicle"]], axis=1
    )
    
    # Add results back to initial dataframe
    df = pd.concat(
        [df.set_index(["vehicle", "trip"]), tmp.set_index(["vehicle", "trip"])], axis=1
    )
    

    Then:

    print(df)
    # Output
                  longitude  latitude   match
    vehicle trip
    0       0            33       155  [3, 5]
            1            34       156     NaN
    1       2            32       154     NaN
            5            37       154     [4]
    2       2            25       145     NaN
    3       1            33       155  [0, 5]
    4       0            37       154     [1]
    5       6            33       155  [0, 3]