The Geolife dataset is a GPS trajectories of users logged as they move. Thanks to Sina Dabiri for providing a repository of the preprocessed data. I work with his preprocessed data and created a dataframe of GSP logs for the 69 users available.
In this post is a very little extract of the data for 3 user to describe by question.
import pandas as pd
data = {'user': [10,10,10,10,10,10,10,10,21,21,21,54,54,54,54,54,54,54,54,54],
'lat': [39.921683,39.921583,39.92156,39.13622,39.136233,39.136241,39.136246,39.136251,42.171678,42.172055,
42.172243,39.16008333,39.15823333,39.1569,39.156,39.15403333,39.15346667,39.15273333,39.14811667,39.14753333],
'lon': [116.472342,116.472315,116.47229,117.218033,117.218046,117.218066,117.218166,117.218186,123.676778,123.677365,
123.677657,117.1994167,117.2002333,117.2007667,117.2012167,117.202,117.20225,117.20255,117.2043167,117.2045833],
'date': ['2009-03-21 13:30:35','2009-03-21 13:33:38','2009-03-21 13:34:40','2009-03-21 15:30:12','2009-03-21 15:32:35',
'2009-03-21 15:38:36','2009-03-21 15:44:42','2009-03-21 15:48:43','2007-04-30 16:00:20', '2007-04-30 16:05:22',
'2007-04-30 16:08:23','2007-04-30 11:47:38','2007-04-30 11:48:07','2007-04-30 11:48:27','2007-04-30 12:04:39',
'2007-04-30 12:04:07','2007-04-30 12:04:32','2007-04-30 12:19:41','2007-04-30 12:20:08','2007-04-30 12:20:21']
}
And the dataframe:
df = pd.DataFrame(data)
df
user lat lon date
0 10 39.921683 116.472342 2009-03-21 13:30:35
1 10 39.921583 116.472315 2009-03-21 13:33:38
2 10 39.921560 116.472290 2009-03-21 13:34:40
3 10 39.136220 117.218033 2009-03-21 15:30:12
4 10 39.136233 117.218046 2009-03-21 15:32:35
5 10 39.136241 117.218066 2009-03-21 15:38:36
6 10 39.136246 117.218166 2009-03-21 15:44:42
7 10 39.136251 117.218186 2009-03-21 15:48:43
8 21 42.171678 123.676778 2007-04-30 16:00:20
9 21 42.172055 123.677365 2007-04-30 16:05:22
10 21 42.172243 123.677657 2007-04-30 16:08:23
11 54 39.160083 117.199417 2007-04-30 11:47:38
12 54 39.158233 117.200233 2007-04-30 11:48:07
13 54 39.156900 117.200767 2007-04-30 11:48:27
14 54 39.156000 117.201217 2007-04-30 12:04:39
15 54 39.154033 117.202000 2007-04-30 12:04:07
16 54 39.153467 117.202250 2007-04-30 12:04:32
17 54 39.152733 117.202550 2007-04-30 12:19:41
18 54 39.148117 117.204317 2007-04-30 12:20:08
19 54 39.147533 117.204583 2007-04-30 12:20:21
My Question:
I want calculate for how long users travel in a particular period.
For example.
March-2009
: Only user 10 travelled in this month. On 2009-03-21
from 13:30:35
. But then after 13:34:40
there is a long jump to 15:30:12
. Since this jumped period is more than 30-minutes, we consider it another trip. So user 10 has 2 trips recorded that month. First for about 5-minutes, second for about 19 minutes. So the duration of users trip for this month is 5 + 19 = 24 minutes
. April 2007
, users 21 and 54 recorded trips on the same day. User 21 started at 16:00:20
for about 8-minutes. User 54 started at 11:47:38
and after about 1-minute, we see a jump to 12:04:39
. The time interval is not up to 30-minutes, so we consider it a single trip. For that, 54 covered trip for about 33-minutes. Users trip time in that month is therefore 8 + 33 = 41minutes
.February 2008
to March 2009
.How do I perform this sort of analysis?
Any point to, using the little data provided above would be appreciated.
this code isn't the most effective, anyway you can test does it do what you need:
df['date'] = pd.to_datetime(df['date'])
duration = (df.groupby(['user', df['date'].dt.month]).
apply(lambda x: (x['date']-x['date'].shift()).dt.seconds).
rename('duration').
to_frame())
res = (duration.mask(duration>1800,0). # 1800 - limit for a trip duration in seconds
groupby(level=[0,1]).
sum().
truediv(60). # converting seconds to minutes
rename_axis(index={'date':'month'}))
print(res)
'''
duration
user month
10 3 22.60
21 4 8.05
54 4 33.25