Search code examples
rclickstream

Replacing the source in click-stream data


I have clickstream data for an ecommerce website. Some customers can opt to buy the product using a loan / finance option. Unfortunately this creates a new referral source - in the reprex below labeled 'finance'. It also creates a new session or sessions.

I would like to replace the source 'finance' with the source for the same user's preceding sessions' source.

In the example all observations for sessions 4-6871.2 & 4-6871.3 would have the source 'direct' as per session 4-6871.1, and 3-6871.1 would have 'google' as the source as per session 3-6871.0

I need to do this on a much larger data set, so I need to apply logic that looks for sessions with the 'finance' source and replace the instances of 'finance' with the immediately preceding source from the user's preceding session.

reprex data via dput:

structure(list(userId = c("6.154032", "6.154032", "6.154032", 
"6.154032", "6.154032", "6.154032", "6.154032", "6.154032", "6.154032", 
"8.154036", "8.154036", "8.154036", "8.154036", "8.154036", "8.154036", 
"8.154036", "8.154036", "8.154036", "8.154036", "8.154036", "8.154036", 
"8.154036", "8.154036"), session_Id = c("4-6871.0", "4-6871.0", 
"4-6871.0", "4-6871.1", "4-6871.1", "4-6871.1", "4-6871.2", "4-6871.2", 
"4-6871.3", "3-6871.0", "3-6871.0", "3-6871.0", "3-6871.0", "3-6871.0", 
"3-6871.1", "3-6871.1", "3-6871.1", "3-6871.1", "3-6871.1", "3-6871.1", 
"3-6871.1", "3-6871.1", "3-6871.1"), timeStamp = structure(c(1540294773, 
1540294828, 1540294841, 1540307321, 1540307341, 1540307718, 1540308709, 
1540308749, 1540311289, 1540330293, 1540330309, 1540330475, 1540330541, 
1540330663, 1540331041, 1540331164, 1540331168, 1540331312, 1540331459, 
1540331465, 1540331579, 1540331603, 1540331630), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), source = c("(direct)", "(direct)", 
"(direct)", "(direct)", "(direct)", "(direct)", "finance", "finance", 
"finance", "google", "google", "google", "google", "google", 
"finance", "finance", "finance", "finance", "finance", "finance", 
"finance", "finance", "finance")), class = c("tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -23L))

Solution

  • Perhaps there is something about your full data structure that invalidates this solution, but here's a candidate:

    df <- arrange(df, userId, timeStamp)
    tmp <- rle(df$source)
    tmp$values[tmp$values == "finance"] <- lag(tmp$values)[tmp$values == "finance"]
    df$source <- inverse.rle(tmp)
    table(df$source)
    # (direct)   google 
    #        9       14 
    

    In the first line I make sure that the order is right. Then, assuming that for no user their first source can immediately be "finance", in the following two lines I replace all "finance" entries with the preceding ones.