Search code examples
sqlpostgresqldatewindow-functionsgaps-and-islands

Modifying the id value of a row based on interval time threshold in date time column


I am working on geolife dataset, which contains timestamped GPS track of users in a text file (.plt). Each text file contains the user's GPS points for one trip. I therefore imported the dataset to postgres using python script.

Because the files are named with string of numbers according to the trip's start time (so for example, the file containing the trip in table below is 20070920074804.plt), I give the trip id (session_id) the file name (without the extension). That's the raw GPS in this table trajectories.

 user_id |    session_id     |       timestamp        |    lat    |    lon     | alt 
---------+-------------------+------------------------+-----------+------------+-----
      11 |    20070920074804 | 2007-09-20 07:48:04+01 |  28.19737 | 113.006795 |  71
      11 |    20070920074804 | 2007-09-20 08:07:09+01 | 28.197685 | 113.006792 |  87
      11 |    20070920074804 | 2007-09-20 08:07:10+01 | 28.197685 |  113.00679 |  87
      11 |    20070920074804 | 2007-09-20 14:03:50+01 | 28.197342 | 113.007422 |  62
      11 |    20070920074804 | 2007-09-20 14:04:59+01 | 28.197108 |  113.00734 |  62
      11 |    20070920074804 | 2007-09-20 14:05:01+01 | 28.197088 |  113.00727 |  62

For analysis purposes, I created another table trips_metrics where I compute trip metrics from trajectories table and insert the result to trip_metrics. Among the values I compute are trip distance (haversine) and duration (start time - end time).

Then I noticed something strange, a user took 8hrs of trip but covers a distance of 321m. Going through the trip file thoroughly I noticed there's jump in trip's time, suggesting a break in the trip (possibly user stays for hours then continue). An example is in row 3 and row 4 in table above.

To get accurate trip time, I need to split trips with these cases, in a way that if the time interval between consecutive rows exceeds 30mins, it should be considered a new trip (thus new ID).

I intend to add digit ..02, ..03, .. to the trip's current session_id in my trajectories table before actually computing trips metrics (i.e. modifying the trajectories table). So for the example in table above, I want to split it this way:

 user_id |    session_id     |       timestamp        |    lat    |    lon     | alt 
---------+-------------------+------------------------+-----------+------------+-----
      11 |  20070920074804   | 2007-09-20 07:48:04+01 |  28.19737 | 113.006795 |  71
      11 |  20070920074804   | 2007-09-20 08:07:09+01 | 28.197685 | 113.006792 |  87
      11 |  20070920074804   | 2007-09-20 08:07:10+01 | 28.197685 |  113.00679 |  87
      11 |  2007092007480402 | 2007-09-20 14:03:50+01 | 28.197342 | 113.007422 |  62
      11 |  2007092007480402 | 2007-09-20 14:04:59+01 | 28.197108 |  113.00734 |  62
      11 |  2007092007480402 | 2007-09-20 14:05:01+01 | 28.197088 |  113.00727 |  62

Notice how I assign the session_id for the new trip (since the time in between is more than 30mins).

How can I do this modification or alteration to my raw GPS table (trajectories) in postgres?

EDIT

A: The first query in the answer from @GMB works, however, it gives each row I new session_id in the new_session_id column.

+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
| user_id |   session_id   |       timestamp        |    lat    |    lon     | alt | is_gap |  new_session_id  |
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
|      11 | 20070920074804 | 2007-09-20 07:48:04+01 | 28.19737  | 113.006795 |  71 |        |   20070920074804 |
|      11 | 20070920074804 | 2007-09-20 08:07:09+01 | 28.197685 | 113.006792 |  87 |      1 | 2007092007480401 |
|      11 | 20070920074804 | 2007-09-20 08:07:10+01 | 28.197685 | 113.00679  |  87 |      1 | 2007092007480402 |
|      11 | 20070920074804 | 2007-09-20 14:03:50+01 | 28.197342 | 113.007422 |  62 |      1 | 2007092007480403 |
|      11 | 20070920074804 | 2007-09-20 14:04:59+01 | 28.197108 | 113.00734  |  62 |      1 | 2007092007480404 |
|      11 | 20070920074804 | 2007-09-20 14:05:01+01 | 28.197088 | 113.00727  |  62 |      1 | 2007092007480405 |
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+

Expected Result:

+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
| user_id |   session_id   |       timestamp        |    lat    |    lon     | alt | is_gap |  new_session_id  |
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
|      11 | 20070920074804 | 2007-09-20 07:48:04+01 | 28.19737  | 113.006795 |  71 |        |   20070920074804 |
|      11 | 20070920074804 | 2007-09-20 08:07:09+01 | 28.197685 | 113.006792 |  87 |        |   20070920074804 |
|      11 | 20070920074804 | 2007-09-20 08:07:10+01 | 28.197685 | 113.00679  |  87 |      1 | 2007092007480401 |
|      11 | 20070920074804 | 2007-09-20 14:03:50+01 | 28.197342 | 113.007422 |  62 |      1 | 2007092007480401 |
|      11 | 20070920074804 | 2007-09-20 14:04:59+01 | 28.197108 | 113.00734  |  62 |      1 | 2007092007480401 |
|      11 | 20070920074804 | 2007-09-20 14:05:01+01 | 28.197088 | 113.00727  |  62 |      1 | 2007092007480401 |
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+

The idea is to give the "emerging" trip a new id by old_session_id + 01. If another emerging trip is encountered it should be assigned old_session_id + 02 and so on.

B: The second query with update option contains a syntax error:

update trajectories t
from (
    select 
        t.*,
        case when sum(is_gap) over(partition by session_id order by timestamp) > 0
            then session_id * 100 + sum(is_gap) over(partition by session_id order by timestamp)
            else session_id
        end new_session_id
    from (
        select
            t.*,
            (timestamp > lag(timestamp) over(partition by session_id order by timestamp))::int is_gap
        from trajectories t
    ) t
) t1
set session_id = t1.new_session_id
where t1.session_id = t.session_id and t1.timestamp = t.timestamp

ERROR:  syntax error at or near "from"
LINE 2: from (

Solution

  • You can use lag(), a cumulative sum to identify the segements, and then some way of munging the session_id:

    select (case when grp >= 1 then session_id * 100 + grp
                 else session_id
            end) as new_session_id,
           t.*
    from (select t.*,
                 count(*) filter (where prev_ts < timestamp - interval '30 minute') over (partition by session_id, order by timestamp) as grp
          from (select t.*, 
                       lag(timestamp) over (partition by session_id order by timestamp) as prev_ts
                from trajectories t
               ) t
         ) t;
    

    Here is a db<>fiddle.