Data based on the dataset from Kaggle here and extracted to R.
Using the following structure:
Index VisitorId VisitId Visit# Hit# pagePath
0 000722514342430295 1470093727 1 1 /home
1 000722514342430295 1470093727 1 3 /google+redesign/apparel
2 000722514342430295 1470093727 1 4 /asearch.html
3 000722514342430295 1470093727 1 5 /asearch.html
4 0014659935183303341 1470037282 1 1 /home
5 0015694432801235877 1470043732 1 1 /home
6 0015694432801235877 1470043732 1 2 /google+redesign/electronics
7 0015694432801235877 1470043732 1 3 /google+redesign/apparel/men++s/men++s+t+shirts
8 0015694432801235877 1470043732 1 4 /google+redesign/apparel/kid+s/kid+s+infant
9 0015694432801235877 1470043732 1 5 /google+redesign/apparel/kid+s/kid+s+infant/quickview
I'm trying to implement a mutate lag function which will return the previous pagepath for a given visit by a given visitor.
For example, new column prev_path
would be both visitorid and visitid specific and would lag Hit# by 1 but would return an <NA>
when not available in the case of Visit 1, Hit 2.
Is this what you're trying to do?
library(dplyr)
df %>%
group_by(VisitorId, VisitId) %>%
mutate(prev_path = ifelse(lag(`Hit#`) == `Hit#` - 1, lag(pagePath), NA))