Search code examples
rmodeling

Generating meaningful sample data in R based on conditions?


I'm trying to generate some sample insurance claims data that is meaningful instead of just random numbers.

Assuming I have two columns Age and Injury, I need meaningful values for ClaimAmount based on certain conditions:

ClaimantAge | InjuryType | ClaimAmount
---------------------------------------
    35        Bruises
    55        Fractures
    .            .
    .            .
    .            .
  1. I want to generate claim amounts that increase as age increases, and then plateaus at around a certain age, say 65.

  2. Claims for certain injuries need to be higher than claims for other types of injuries.

Currently I am generating my samples in a random manner, like so:

amount <- sample(0:100000, 2000, replace = TRUE)  

How do I generate more meaningful samples?


Solution

  • There are many ways that this could need to be adjusted, as I don't know the field. Given that we're talking about dollar amounts, I would use the poisson distribution to generate data.

    set.seed(1)
    n_claims <- 2000
    injuries <- c("bruises", "fractures")
    prob_injuries <- c(0.7, 0.3)
    
    sim_claims <- data.frame(claimid = 1:n_claims)
    sim_claims$age <- round(rnorm(n = n_claims, mean = 35, sd = 15), 0)
    sim_claims$Injury <- factor(sample(injuries, size = n_claims, replace = TRUE, prob = prob_injuries))
    sim_claims$Amount <- rpois(n_claims, lambda = 100 + (5 * (sim_claims$age - median(sim_claims$age))) + 
                                 dplyr::case_when(sim_claims$Injury == "bruises" ~ 50,
                                                  sim_claims$Injury == "fractures" ~ 500))
    
    head(sim_claims)
    
      claimid age    Injury Amount
    1       1  26   bruises    117
    2       2  38   bruises    175
    3       3  22   bruises    102
    4       4  59   bruises    261
    5       5  40 fractures    644
    6       6  23   bruises     92