Search code examples
rdummy-datadata-generation

Generating dummy webshop data in R: Incorporating parameters when randomly generating transactions


For a course I am currently in I am trying to build a dummy transaction, customer & product dataset to showcase a machine learning usecase in a webshop environment as well as a financial dashboard; unfortunately, we have not been given dummy data. I figured this'd be a nice way to improve my R knowledge, but am experiencing severe difficulties in realizing it.

The idea is that I specify some parameters/rules (arbitrary/fictitious, but applicable for a demonstration of a certain clustering algorithm). I'm basically trying to hide a pattern to then re-find this pattern utilizing machine learning (not part of this question). The pattern I'm hiding is based on the product adoption life cycle, attempting to show how identifying different customer types could be used for targeted marketing purposes.

I'll demonstrate what I'm looking for. I'd like to keep it as realistic as possible. I attempted to do so by assigning the number of transactions per customer and other characteristics to normal distributions; I am completely open to potential other ways to do this?

The following is how far I have come, first build a table of customers:

# Define Customer Types & Respective probabilities
CustomerTypes <- c("EarlyAdopter","Pragmatists","Conservatives","Dealseekers")
PropCustTypes <- c(.10, .45, .30, .15)   # Probability of being in each group.

set.seed(1)   # Set seed to make reproducible
Customers <- data.frame(ID=(1:10000), 
  CustomerType = sample(CustomerTypes, size=10000,
                                  replace=TRUE, prob=PropCustTypes),
  NumBought = rnorm(10000,3,2)   # Number of Transactions to Generate, open to alternative solutions?
)
Customers[Customers$Numbought<0]$NumBought <- 0   # Cap NumBought at 0 

Next, generate a table of products to choose from:

Products <- data.frame(
  ID=(1:50),
  DateReleased = rep(as.Date("2012-12-12"),50)+rnorm(50,0,8000),
  SuggestedPrice = rnorm(50, 50, 30))
Products[Products$SuggestedPrice<10,]$SuggestedPrice <- 10   # Cap ProductPrice at 10$
Products[Products$DateReleased<as.Date("2013-04-10"),]$DateReleased <- as.Date("2013-04-10")   # Cap Releasedate to 1 year ago 

Now I would like to generate n transactions (number is in customer table above), based on the following parameters for each variable that is currently relevant).

Parameters <- data.frame(
  CustomerType= c("EarlyAdopter", "Pragmatists", "Conservatives", "Dealseeker"),
  BySearchEngine   = c(0.10, .40, 0.50, 0.6), # Probability of coming through channel X
  ByDirectCustomer = c(0.60, .30, 0.15, 0.05),
  ByPartnerBlog    = c(0.30, .30,  0.35, 0.35),
  Timeliness = c(1,6,12,12), # Average # of months between purchase & releasedate.
  Discount = c(0,0,0.05,0.10), # Average Discount incurred when purchasing.
    stringsAsFactors=FALSE)

Parameters
   CustomerType BySearchEngine ByDirectCustomer ByPartnerBlog Timeliness Discount
1  EarlyAdopter            0.1             0.60          0.30          1     0.00
2   Pragmatists            0.4             0.30          0.30          6     0.00
3 Conservatives            0.5             0.15          0.35         12     0.05
4    Dealseeker            0.6             0.05          0.35         12     0.10

The idea is that 'EarlyAdopters' would have (on average, normally distributed) 10% of transactions with a label 'BySearchEngine', 60% 'ByDirectCustomer' and 30% 'ByPartnerBlog'; these values need to exclude each other: one cannot be obtained via both a PartnerBlog and via a Search Engine in the final dataset. The options are:

ObtainedBy <- c("SearchEngine","DirectCustomer","PartnerBlog")

Furthermore, I'd like to generate a discount variable that is normally distributed utilizing the above means. For simplicity, standard deviations may be mean/5.

Next, my most tricky part, I'd like to generate these transactions according to a few rules:

  • Somewhat evenly distributed over days, maybe slightly more during the weekend;
  • Spread out between 2006-2014.
  • Spreading out the # of transactions of customers over the years;
  • Customers cannot buy products that haven't been released yet.

Other Parameters:

YearlyMax <- 1 # ? How would I specify this, a growing number would be even nicer?
DailyMax <-  1 # Same question? Likely dependent on YearlyMax

The result for CustomerID 2 would be:

Transactions <- data.frame(
    ID        = c(1,2),
    CustomerID = c(2,2), # The customer that bought the item.
    ProductID = c(51,100), # Products chosen to approach customer type's Timeliness average
    DateOfPurchase = c("2013-01-02", "2012-12-03"), # Date chosen to mimic timeliness average
    ReferredBy = c("DirectCustomer", "SearchEngine"), # See above, follows proportions previously identified.
    GrossPrice = c(50,52.99), # based on Product Price, no real restrictions other than using it for my financial dashboard.
    Discount = c(0.02, 0.0)) # Chosen to mimic customer type's discount behavior.    

Transactions
  ID CustomerID ProductID DateOfPurchase     ReferredBy GrossPrice Discount
1  1          2        51     2013-01-02 DirectCustomer      50.00     0.02
2  2          2       100     2012-12-03   SearchEngine      52.99     0.00

I'm getting more and more confident in writing R code, but I'm having difficulties writing the code to keep the global parameters (daily distributions of transactions, yearly maximum of # transactions per customer) as well as the various linkages in line:

  • Timeliness: how quick people purchase after release
  • ReferredBy: how did this customer arrive to my website?
  • How much discount has the customer had (to illustrate how sensitive one is to discounts)

This causes me to not know whether I should write a for loop over the customer table, generating transactions per customer, or whether I should take a different route. Any contributions are greatly appreciated. Alternative dummy datasets are welcome as well, even though I'm eager to solve this problem by means of R. I'll keep this post updated as I progress.

My current pseudocode:

  • Assign customer to customer type with sample()
  • Generate Customers$NumBought transactions
  • ... Still thinking?

EDIT: Generating the transactions table, now I 'just' need to fill it with the right data:

Tr <- data.frame(
  ID = 1:sum(Customers$NumBought),
  CustomerID = NA,
  DateOfPurchase = NA,
  ReferredBy = NA,
  GrossPrice=NA,
  Discount=NA)

Solution

  • Very roughly, set up an database of days, and number of visits in that day:

    days<- data.frame(day=1:8000, customerRate = 8000/XtotalNumberOfVisits)
    # you could change the customerRate to reflect promotions, time since launch, ...
    days$nVisits <- rpois(8000, days$customerRate)
    

    Then catalogue the visits

        visits <- data.frame(id=1:sum(days$nVisits), day=rep(days$day, times=days$nVisits)
        visits$customerType <- sample(4, nrow(visits), replace=TRUE, prob=XmyWeights)
        visits$nPurchases <- rpois(nrow(vists), XpurchaseRate[visits$customerType])
    

    Any of the variables with X in front of them are parameters of your process. You'd similarly go on to generate a transactions database by parametrising the relative likelihood amongst objects available, according to the other columns you have. Or you can generate a visits database including a key to each product available at that day:

       productRelease <- data.frame(id=X, releaseDay=sort(X)) # ie df is sorted by releaseDay
       visits <- data.frame(id=1:sum(days$nVisits), day=rep(days$day, times=days$nVisits)
       visits$customerType <- sample(4, nrow(visits), replace=TRUE, prob=XmyWeights)
       day$productsAvailable = rep(1:nrow(productRelease), times=diff(c(productRelease$releaseDay, nrow(days)+1)))
       visits <- visits[(1:nrow(visits))[day$productsAvailable],]
       visits$prodID <- with(visits, ave(rep(id==id, id, cumsum))
    

    You can then decide a function that gives you, for each row, a probability of the customer purchasing that item (based on day, customer, product). And then fill in the purchase by `visits$didTheyPurchase <- runif(nrow(visits)) < XmyProbability.

    Sorry, there's probably typos's littered throughout this as I was typing it straight, but hopefully this gives you an idea.