Search code examples
rdplyrdata.tablegroupingcluster-analysis

Get percentage of remaining users after a specified date in R


I am working with a user generated data set (say it's app user data or service), and I cluster it based on user behaviour characteristics i.e. frequency of use. I would like to see how many, or what percentage of users stop using the app/service after a specific date and from what cluster they come from.

Here is an reproducible example which I hope is appropriate:-

library(Pareto)
library(uuid)
library(ggplot2)
library(tidyverse)
library(data.table)
set.seed(1)

n_users <- 100
n_rows <- 3650

relative_probs <- rPareto(n = n_users, t = 1, alpha = 0.3, truncation = 500) 
unique_ids <- UUIDgenerate(n = n_users)

id_sample <- sample(unique_ids, size = n_rows, prob = relative_probs, replace = TRUE)
Date<-seq(as.Date("2015-01-01"), as.Date("2015-12-31"), by = "1 day")
Date<-sample(Date,size = n_rows,replace = T)

df<-data.frame(id_sample,Date)
df
df<-df%>%arrange(Date)


userData<-df%>%
  group_by(id_sample)%>%
  summarise(Count=n())

scaledData<-scale(userData[,2])
scaledData

set.seed(15)
clust<-kmeans(scaledData, centers=5, nstart = 15)

userData$Cluster<-clust$cluster

setDT(userData)[Cluster==1, ClusterName:="Cluster 1"]
userData[Cluster==2, ClusterName:="Cluster 2"]
userData[Cluster==3, ClusterName:="Cluster 3"]
userData[Cluster==4, ClusterName:="Cluster 4"]
userData[Cluster==5, ClusterName:="Cluster 5"]

user_vars<-userData%>%select(id_sample,ClusterName)

df<-merge(df,user_vars,by="id_sample")

df$Month<-lubridate::month(df$Date)

df%>%
  group_by(Month)%>%
  summarise(N_Users=n_distinct(id_sample))

I am wondering if there was a dplyr solution or something similar where I could set a Date as a threshold to see how many Users (as a percentage or count) that were in the data before the specified date, appear after the specified date and at cluster level. For instance, a solution that captures all the unique user ID's prior to the specified date, with the outcome showing what percentage of these users are still left in the data after the specified date (grouped by cluster level).


Solution

  • library(lubridate)
    how_many=function(df, cluster, my_date) {
      df1=df%>%filter(ClusterName==cluster)
      before=filter(df1, Date<my_date) 
      after=filter(df1,Date>my_date)
      count=0
      for (i in unique(before$id_sample)) {
        if (i %in% after$id_sample) {
          count=count+1
        }
      }
      return(c(count, count/n_distinct(before$id_sample)))
    }
    

    This function takes the data frame (df), the cluster name, and the date that you want to see the proportion/number of unique users before the date appeared at least once after the date.

    how_many(df, "Cluster 4", make_date(2015, 05, 19))
    
    39.0000000  0.9069767