I am very new to programming, therefore, I apologize in case my question may seem to fundamental.
Basically I have now a data set of apprx. 300 rows. The idea was now to create an entire new data set with the size of 10k for instance, however, which still has the same characteristics as the smlla data set of 300.
ID Category1 Category2 Amount1 Probability1
1 Class1 A 100 0.3
2 Class2 B 800 0.2
3 Class3 C 300 0.7
4 Class2 A 250 0.4
5 Class3 C 900 0.6
I already did exploratory analysis. I know that my numeric data has a beta distribution and I know the mean and sd (and the level of skewness in case it is relevant) For my categorical data I know the percent distribution so for instance category A take 25% of the data set. Category B takes 35% and category C takes 40%.
My question now is: what are the best packages in order to simulate this data and to create a bigger data set?
I found on the simstudy package which seemed very goodm however, I am still very new to programming and I'm having hard time to get my head around the code.
Here is the link to the description https://cran.r-project.org/web/packages/simstudy/vignettes/simstudy.html (I also checked the R documentation but for a newbie like me it is very hard to follow and fully understand it)
I still don't really get how I can define there my categorical values. (They set there the percent distribution of the single classes but they dont actually set what apply to which class.
Maybe, someone here could help me explain me how I could apply it on my data set or is there another better package for that?
Thank you very much in advance!
EDIT
So my current code with the simstudy package is the following:
def <- defData(varname = "Product_Class", formula = "0.25;0.35;0.4", dist = "categorical")
def <- defData(varname = "Category", formula = "0.25;0.35;0.4", dist = "categorical")
def <- defData(def, varname = "Amount", dist = "beta", formula = 0.6, variance = 0.12)
def <- defData(def, varname = "Amount2", dist = "beta", formula = 0.45, variance = 0.1)
def <- defData(def, varname = "Probability", dist = "beta", formula = 0.4, variance = 0.23)
However, here my problem is that I cant create a skewed beta distribution (and I know that my data is skewed to the right).
Alternativey, I could use this formula, but here i have to create each column seperately and I can not create a relationship between some columns (f.i. correlation, which I would have to create later on as well)
rsbeta(n, shape1, shape)
# shape1 <0 & shape2 >0 creates a right skewede beta distribution
rsbeta(1000, 0.2,3)
Any other suggestions how to resolve this problem?
How do you usually do simulations of different data sets which have only a limited amount of entries ?
I actually have done something exactly like this. I'm calculating the actual min and max for each variable, so I can simulate to mimic my own original dataset. Using simstudy has several advantages over just using sample
, primarily that sample
only takes from the existing data available, while simstudy generates any potential value between the minimum and maximum (for numeric types), or a proportion for the categorical variables. Simstudy is also useful if your original data is sensitive/personal data, so you can bypass privacy problems compared to using sample
. This is what I did:
library(skimr)
library(simstudy)
library(dplyr)
library(glue)
sim_definitions <-
skim_to_wide(iris) %>%
mutate(min = as.numeric(p0), max = as.numeric(p100)) %>%
transmute(
varname = variable,
dist = case_when(
# For binary data if it is only 0 and 1
n_unique == 2 ~ "binary",
n_unique > 2 ~ "categorical",
TRUE ~ "uniform"
),
formula = case_when(
dist == "uniform" ~ as.character(glue("{min};{max}")),
# For only factors with 3 levels. number is proportion. 0.3 = 30%
dist == "categorical" ~ "0.5;0.2;0.3",
dist == "binary" ~ "0.2",
# other wise 10 is min, 20 is max
TRUE ~ "10;20"
),
link = case_when(
dist == "binary" ~ "logit",
TRUE ~ "identity"
)
)
# 1000 is the final size of the dataset. Change to what ever you want.
simulated_data <- genData(1000, sim_definitions)
dim(simulated_data)
head(simulated_data)
NOTE: I see to have an error with simstudy. Not sure if it's because of an update. Let me know if this works for you. UPDATE: Seems the categorical specification causes the error but I was unable to find the problem.
UPDATE based on clarification in question and comments:
Your code works fine in generating a simulated dataset. If you want to force a skewed distribution, you can use base R's distribution functions like qlnorm
. So:
library(simstudy)
#> Loading required package: data.table
def <- defData(varname = "Product_Class", formula = "0.25;0.35;0.4", dist = "categorical")
def <- defData(def, varname = "Category", formula = "0.25;0.35;0.4", dist = "categorical")
def <- defData(def, varname = "Amount", dist = "beta", formula = 0.6, variance = 0.12)
def <- defData(def, varname = "Amount2", dist = "beta", formula = 0.45, variance = 0.1)
def <- defData(def, varname = "Probability", dist = "beta", formula = 0.4, variance = 0.23)
simulated_data <- genData(1000, def)
hist(simulated_data$Amount2)
simulated_data$Amount2 <- qlnorm(simulated_data$Amount2)
hist(simulated_data$Amount2)
Created on 2019-03-24 by the reprex package (v0.2.1)