Search code examples
sqlsnowflake-cloud-data-platformdistance

Finding the total distance of a grouped trips in SQL


I have a data that shows individual vs grouped trips of a taxi service

|----------------------------------------------------------------------------------------------------------------------------------------------|
|  Trip_id  |   Trip_Created_Time    |   start_lat     |   start_lon    |     end_lat    |    end_lon   | trip_updated_time   |    Is_Group    |
|----------------------------------------------------------------------------------------------------------------------------------------------|
|    1      |   2021-07-01 17:29:51  |     81.91892    |  -42.19823     |   81.90281     |   -42.38918  | 2021-07-01 17:35:21 |      YES       |
|    2      |   2021-07-01 17:31:52  |     81.91892    |  -42.46920     |   81.97392     |   -42.37819  | 2021-07-01 17:52:51 |      YES       |
|    3      |   2021-07-02 21:50:51  |     81.91892    |  -42.01936     |   81.18937     |   -42.01967  | 2021-07-02 22:09:09 |      NO        |
|    4      |   2021-07-02 23:31:41  |     81.91892    |  -42.47821     |   81.01792     |   -42.17839  | 2021-07-02 23:41:51 |      NO        |
|    5      |   2021-09-21 20:12:54  |     81.91892    |  -42.47821     |   81.63829     |   -42.67292  | 2021-09-21 20:42:54 |      YES       |
|    6      |   2021-09-21 20:15:21  |     81.91892    |  -42.47821     |   81.62819     |   -42.01927  | 2021-09-21 20:59:21 |      YES       |
|    7      |   2021-09-21 20:17:23  |     81.91892    |  -42.47821     |   81.03926     |   -42.36284  | 2021-09-21 21:02:21 |      YES       |
|    8      |   2021-11-01 02:41:41  |     81.91892    |  -42.47821     |   81.36292     |   -42.47682  | 2021-07-02 23:41:51 |      NO        |
|    9      |   2021-12-21 19:19:41  |     81.91892    |  -42.47821     |   81.23671     |   -42.93628  | 2021-07-02 23:41:51 |      NO        |
|----------------------------------------------------------------------------------------------------------------------------------------------|

Group = Two or more users start at same location but with different destinations

I'm trying to find the distance from starting lat/lon to the ending lat/lon for grouped and non-grouped trips

Here's my try:

select *,
       case when is_group = 'NO'
            then haversine(start_lat, start_lon, end_lat, end_lon)
        when is_group = 'YES'
            then NULL
       end as trip_distance
from my_table

The current output is as follows:

|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
|  Trip_id  |   Trip_Created_Time    |   start_lat     |   start_lon    |     end_lat    |    end_lon   | trip_updated_time   |    Is_Group    |  trip_distance |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
|    1      |   2021-07-01 17:29:51  |     81.91892    |  -42.19823     |   81.90281     |   -42.38918  | 2021-07-01 17:35:21 |      YES       |     NULL       |
|    2      |   2021-07-01 17:31:52  |     81.91892    |  -42.46920     |   81.97392     |   -42.37819  | 2021-07-01 17:52:51 |      YES       |     NULL       |
|    3      |   2021-07-02 21:50:51  |     81.91892    |  -42.01936     |   81.18937     |   -42.01967  | 2021-07-02 22:09:09 |      NO        |    1.289247    |
|    4      |   2021-07-02 23:31:41  |     81.91892    |  -42.47821     |   81.01792     |   -42.17839  | 2021-07-02 23:41:51 |      NO        |    0.387922    |
|    5      |   2021-09-21 20:12:54  |     81.91892    |  -42.47821     |   81.63829     |   -42.67292  | 2021-09-21 20:42:54 |      YES       |     NULL       |
|    6      |   2021-09-21 20:15:21  |     81.91892    |  -42.47821     |   81.62819     |   -42.01927  | 2021-09-21 20:59:21 |      YES       |     NULL       |
|    7      |   2021-09-21 20:17:23  |     81.91892    |  -42.47821     |   81.03926     |   -42.36284  | 2021-09-21 21:02:21 |      YES       |     NULL       |
|    8      |   2021-11-01 02:41:41  |     81.91892    |  -42.47821     |   81.36292     |   -42.47682  | 2021-07-02 23:41:51 |      NO        |    3.29181     |
|    9      |   2021-12-21 19:19:41  |     81.91892    |  -42.47821     |   81.23671     |   -42.93628  | 2021-07-02 23:41:51 |      NO        |    0.29822     |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------|

How can I calculate the trip_distance if it is in a group i.e. is_group = 'YES'

Edit: The final output distance for a grouped should be the sum of distances of all trips in this group i.e. A->B->C = A+B+C

|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
|  Trip_id  |   Trip_Created_Time    |   start_lat     |   start_lon    |     end_lat    |    end_lon   | trip_updated_time   |    Is_Group    |  trip_distance |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
|    1      |   2021-07-01 17:29:51  |     81.91892    |  -42.19823     |   81.90281     |   -42.38918  | 2021-07-01 17:35:21 |      YES       |    1.28463     | <--- 
|    2      |   2021-07-01 17:31:52  |     81.91892    |  -42.46920     |   81.97392     |   -42.37819  | 2021-07-01 17:52:51 |      YES       |    1.28463     | <--- These two total distances are same since grouped
|    3      |   2021-07-02 21:50:51  |     81.91892    |  -42.01936     |   81.18937     |   -42.01967  | 2021-07-02 22:09:09 |      NO        |    1.289247    |
|    4      |   2021-07-02 23:31:41  |     81.91892    |  -42.47821     |   81.01792     |   -42.17839  | 2021-07-02 23:41:51 |      NO        |    0.387922    |
|    5      |   2021-09-21 20:12:54  |     81.91892    |  -42.47821     |   81.63829     |   -42.67292  | 2021-09-21 20:42:54 |      YES       |    4.38921     | <---
|    6      |   2021-09-21 20:15:21  |     81.91892    |  -42.47821     |   81.62819     |   -42.01927  | 2021-09-21 20:59:21 |      YES       |    4.38921     | <---
|    7      |   2021-09-21 20:17:23  |     81.91892    |  -42.47821     |   81.03926     |   -42.36284  | 2021-09-21 21:02:21 |      YES       |    4.38921     | <--- These three total distances are same since grouped
|    8      |   2021-11-01 02:41:41  |     81.91892    |  -42.47821     |   81.36292     |   -42.47682  | 2021-07-02 23:41:51 |      NO        |    3.29181     |
|    9      |   2021-12-21 19:19:41  |     81.91892    |  -42.47821     |   81.23671     |   -42.93628  | 2021-07-02 23:41:51 |      NO        |    0.29822     |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------|

Solution

  • So you example data/SQL does not produce you example output:

    WITH fake_data AS (
        SELECT * FROM VALUES
            ( 1, '2021-07-01 17:29:51', 81.91892, -42.19823, 81.90281, -42.38918, '2021-07-01 17:35:21', 'YES', 1),
            ( 2, '2021-07-01 17:31:52', 81.91892, -42.46920, 81.97392, -42.37819, '2021-07-01 17:52:51', 'YES', 1),
            ( 3, '2021-07-02 21:50:51', 81.91892, -42.01936, 81.18937, -42.01967, '2021-07-02 22:09:09', 'NO', null),
            ( 4, '2021-07-02 23:31:41', 81.91892, -42.47821, 81.01792, -42.17839, '2021-07-02 23:41:51', 'NO', null),
            ( 5, '2021-09-21 20:12:54', 81.91892, -42.47821, 81.63829, -42.67292, '2021-09-21 20:42:54', 'YES', 2),
            ( 6, '2021-09-21 20:15:21', 81.91892, -42.47821, 81.62819, -42.01927, '2021-09-21 20:59:21', 'YES', 2),
            ( 7, '2021-09-21 20:17:23', 81.91892, -42.47821, 81.03926, -42.36284, '2021-09-21 21:02:21', 'YES', 2),
            ( 8, '2021-11-01 02:41:41', 81.91892, -42.47821, 81.36292, -42.47682, '2021-07-02 23:41:51', 'NO', null),
            ( 9, '2021-12-21 19:19:41', 81.91892, -42.47821, 81.23671, -42.93628, '2021-07-02 23:41:51', 'NO', null)
            t(trip_id, trip_created_time, start_lat, start_lon, end_lat, end_lon, trip_updated_time, is_group, group_id)
    )
    select *,
           case when is_group = 'NO'
                then haversine(start_lat, start_lon, end_lat, end_lon)
            when is_group = 'YES'
                then NULL
           end as trip_distance
    from fake_data;
    

    creates:

    TRIP_ID TRIP_CREATED_TIME START_LAT START_LON END_LAT END_LON TRIP_UPDATED_TIME IS_GROUP GROUP_ID TRIP_DISTANCE
    1 2021-07-01 17:29:51 81.91892 -42.19823 81.90281 -42.38918 2021-07-01 17:35:21 YES 1
    2 2021-07-01 17:31:52 81.91892 -42.4692 81.97392 -42.37819 2021-07-01 17:52:51 YES 1
    3 2021-07-02 21:50:51 81.91892 -42.01936 81.18937 -42.01967 2021-07-02 22:09:09 NO 81.122258891
    4 2021-07-02 23:31:41 81.91892 -42.47821 81.01792 -42.17839 2021-07-02 23:41:51 NO 100.308299209
    5 2021-09-21 20:12:54 81.91892 -42.47821 81.63829 -42.67292 2021-09-21 20:42:54 YES 2
    6 2021-09-21 20:15:21 81.91892 -42.47821 81.62819 -42.01927 2021-09-21 20:59:21 YES 2
    7 2021-09-21 20:17:23 81.91892 -42.47821 81.03926 -42.36284 2021-09-21 21:02:21 YES 2
    8 2021-11-01 02:41:41 81.91892 -42.47821 81.36292 -42.47682 2021-07-02 23:41:51 NO 61.824383293
    9 2021-12-21 19:19:41 81.91892 -42.47821 81.23671 -42.93628 2021-07-02 23:41:51 NO 76.223649989

    But if we assume these are valid:

    what you describe as wanting ( deconstructed ) seems like (if a grouping id is provided):

    SELECT 
        trip_id
        ,trip_created_time
        ,start_lat
        ,start_lon
        ,end_lat
        ,end_lon 
        ,trip_updated_time
        ,is_group
        ,round(haversine(start_lat, start_lon, end_lat, end_lon),3) as dist_km
        ,sum(dist_km) over (partition by group_id) as group_sum_km
        ,iff(is_group='YES', group_sum_km, dist_km) as result
    FROM fake_data
    ORDER BY 1
    ;
    

    gives:

    TRIP_ID TRIP_CREATED_TIME START_LAT START_LON END_LAT END_LON TRIP_UPDATED_TIME IS_GROUP DIST_KM GROUP_SUM_KM RESULT
    1 2021-07-01 17:29:51 81.91892 -42.19823 81.90281 -42.38918 2021-07-01 17:35:21 YES 3.484 9.762 9.762
    2 2021-07-01 17:31:52 81.91892 -42.4692 81.97392 -42.37819 2021-07-01 17:52:51 YES 6.278 9.762 9.762
    3 2021-07-02 21:50:51 81.91892 -42.01936 81.18937 -42.01967 2021-07-02 22:09:09 NO 81.122 319.478 81.122
    4 2021-07-02 23:31:41 81.91892 -42.47821 81.01792 -42.17839 2021-07-02 23:41:51 NO 100.308 319.478 100.308
    5 2021-09-21 20:12:54 81.91892 -42.47821 81.63829 -42.67292 2021-09-21 20:42:54 YES 31.358 162.332 162.332
    6 2021-09-21 20:15:21 81.91892 -42.47821 81.62819 -42.01927 2021-09-21 20:59:21 YES 33.142 162.332 162.332
    7 2021-09-21 20:17:23 81.91892 -42.47821 81.03926 -42.36284 2021-09-21 21:02:21 YES 97.832 162.332 162.332
    8 2021-11-01 02:41:41 81.91892 -42.47821 81.36292 -42.47682 2021-07-02 23:41:51 NO 61.824 319.478 61.824
    9 2021-12-21 19:19:41 81.91892 -42.47821 81.23671 -42.93628 2021-07-02 23:41:51 NO 76.224 319.478 76.224

    which can be compacted to:

    SELECT 
        trip_id
        ,trip_created_time
        ,start_lat
        ,start_lon
        ,end_lat
        ,end_lon 
        ,trip_updated_time
        ,is_group
        , iff(is_group='YES'
            ,sum(haversine(start_lat, start_lon, end_lat, end_lon)) over (partition by group_id)
            ,haversine(start_lat, start_lon, end_lat, end_lon)
          ) as result
    FROM fake_data
    ORDER BY 1
    ;