python python-3.x pandas dataframe sklearn-pandas

Users' trip time over a particular period of time

The Geolife dataset is a GPS trajectories of users logged as they move. Thanks to Sina Dabiri for providing a repository of the preprocessed data. I work with his preprocessed data and created a dataframe of GSP logs for the 69 users available.

In this post is a very little extract of the data for 3 user to describe by question.

import pandas as pd

data = {'user': [10,10,10,10,10,10,10,10,21,21,21,54,54,54,54,54,54,54,54,54],
 'lat': [39.921683,39.921583,39.92156,39.13622,39.136233,39.136241,39.136246,39.136251,42.171678,42.172055,
         42.172243,39.16008333,39.15823333,39.1569,39.156,39.15403333,39.15346667,39.15273333,39.14811667,39.14753333],
 'lon': [116.472342,116.472315,116.47229,117.218033,117.218046,117.218066,117.218166,117.218186,123.676778,123.677365,
         123.677657,117.1994167,117.2002333,117.2007667,117.2012167,117.202,117.20225,117.20255,117.2043167,117.2045833],
 'date': ['2009-03-21 13:30:35','2009-03-21 13:33:38','2009-03-21 13:34:40','2009-03-21 15:30:12','2009-03-21 15:32:35',
          '2009-03-21 15:38:36','2009-03-21 15:44:42','2009-03-21 15:48:43','2007-04-30 16:00:20', '2007-04-30 16:05:22',
          '2007-04-30 16:08:23','2007-04-30 11:47:38','2007-04-30 11:48:07','2007-04-30 11:48:27','2007-04-30 12:04:39',
          '2007-04-30 12:04:07','2007-04-30 12:04:32','2007-04-30 12:19:41','2007-04-30 12:20:08','2007-04-30 12:20:21']
 }

And the dataframe:

df = pd.DataFrame(data)

df
    user    lat        lon            date
0   10  39.921683   116.472342  2009-03-21 13:30:35
1   10  39.921583   116.472315  2009-03-21 13:33:38
2   10  39.921560   116.472290  2009-03-21 13:34:40
3   10  39.136220   117.218033  2009-03-21 15:30:12
4   10  39.136233   117.218046  2009-03-21 15:32:35
5   10  39.136241   117.218066  2009-03-21 15:38:36
6   10  39.136246   117.218166  2009-03-21 15:44:42
7   10  39.136251   117.218186  2009-03-21 15:48:43
8   21  42.171678   123.676778  2007-04-30 16:00:20
9   21  42.172055   123.677365  2007-04-30 16:05:22
10  21  42.172243   123.677657  2007-04-30 16:08:23
11  54  39.160083   117.199417  2007-04-30 11:47:38
12  54  39.158233   117.200233  2007-04-30 11:48:07
13  54  39.156900   117.200767  2007-04-30 11:48:27
14  54  39.156000   117.201217  2007-04-30 12:04:39
15  54  39.154033   117.202000  2007-04-30 12:04:07
16  54  39.153467   117.202250  2007-04-30 12:04:32
17  54  39.152733   117.202550  2007-04-30 12:19:41
18  54  39.148117   117.204317  2007-04-30 12:20:08
19  54  39.147533   117.204583  2007-04-30 12:20:21

My Question:

I want calculate for how long users travel in a particular period.

For example.

Total time users travelled in March-2009: Only user 10 travelled in this month. On 2009-03-21 from 13:30:35. But then after 13:34:40 there is a long jump to 15:30:12. Since this jumped period is more than 30-minutes, we consider it another trip. So user 10 has 2 trips recorded that month. First for about 5-minutes, second for about 19 minutes. So the duration of users trip for this month is 5 + 19 = 24 minutes.
In April 2007, users 21 and 54 recorded trips on the same day. User 21 started at 16:00:20 for about 8-minutes. User 54 started at 11:47:38 and after about 1-minute, we see a jump to 12:04:39. The time interval is not up to 30-minutes, so we consider it a single trip. For that, 54 covered trip for about 33-minutes. Users trip time in that month is therefore 8 + 33 = 41minutes.
Sometimes, I would also want to determined trip time from say February 2008 to March 2009.

How do I perform this sort of analysis?

Any point to, using the little data provided above would be appreciated.

Solution

this code isn't the most effective, anyway you can test does it do what you need:

df['date'] = pd.to_datetime(df['date'])

duration = (df.groupby(['user', df['date'].dt.month]).
            apply(lambda x: (x['date']-x['date'].shift()).dt.seconds).
            rename('duration').
            to_frame())

res = (duration.mask(duration>1800,0).  # 1800 - limit for a trip duration in seconds
       groupby(level=[0,1]).
       sum().
       truediv(60).  # converting seconds to minutes
       rename_axis(index={'date':'month'}))

print(res)
'''
            duration
user month          
10   3         22.60
21   4          8.05
54   4         33.25