logging apache-spark machine-learning usage-statistics

clustering by access timestamp

Assume we have below web access logs.

timestamp                  page_visted
======================================
2017-01-02 10:00:02         /xxx/a.html
2017-01-02 10:00:06         /xxx/b.html
2017-01-02 10:01:03         /xxx/c.html
2017-01-02 10:02:02         /xxx/d.html
2017-01-02 15:00:02         /xxx/a.html
2017-01-02 15:01:10         /xxx/b.html
2017-01-02 15:03:05         /xxx/c.html

The user visited our web site 2 time and visited 7 pages. My question is "What is the best way to know how many times he visited our web site instead of how many pages he visited?"

Because the user might access different amount of pages and spend different time for each visit, it is hard to set a fixed number or interval to group those records. Is there any algorithm to group(cluster) those records based on their timestamp? Thanks.

Solution

Session start/end

A simple approach is just pick a number that indicates a session has ended, I've seen 20 minutes of inactivity used to show a session has ended.

A more robust approach involves treating this as a probabilistic problem given that there is no fixed length of a session, or fixed amount of time between sessions.

The first thing you need to do is look at the data. Particularly the inter-arrival times. You have a list of page_visited events. You'll need to plot the distribution of inter-arrival times in seconds (time elapsed between page visits).

A fair assumption is that the distribution will look Poisson-like, or it will be Poisson-like but additional humps if inter-session times are indeed clustered.

If the data shows a nice Poisson distribution, a simple approach would be to use the distribution of visit times directly.

By taking a percentile that is appropriate to your use-case from the distribution of inter-arrival times, you may determine a pretty useful threshold above which the inter-arrival time suggests a new session has started.

Alternatively, if it is more useful, you may use the distribution to obtain the probability of observing the inter-arrival time, with a low probability indicating the start/end of a new session.

More complex is if the distribution is bi-modal, say, because people tend to space their sessions similarly. If so, it may be more simple to explore a simple clustering algorithm such as k-means, on the inter-arrival times, where you would expect one cluster for in-session visits, and one cluster for inter-session visits.

Count sessions

Once you have arrived at an appropriate method to identify distinct sessions, it is a simple case to assign each session with a unique key, group by user and count the unique keys.