So in the biotech company I work for, we scan for small biomarkers that are present in various sample types from experiments being performed by research institutions. For example, some research institution/department is studying what biomarkers may be present in certain individuals of some population of interest. Perhaps over time, they do this kind of a study three times (not necessarily with the same individuals, though some of the same individuals may be present in the future studies -- this is case by case dependent), and then they want to merge the data into one single, finalized data frame. Sounds easy enough, right? Well, it can be a bit more complicated than just stacking the data sets on top of each other.
As you might imagine, sometimes a biomarker shows up in some individuals and not in others. Furthermore, sometimes, a biomarker shows up solely in one of the sampled populations and not any of the other ones sampled. I'm working on an internal R&D project for my company, and am trying to simulate data that we might get from three separate experiments (which are stored in the three "toy" data frames I've created in the code provided below). If you run what I've created below, it will result in one "all" data frame at the end that consists of 30 observations from 30 (fake) individuals, where each biomarker is a column labeled "x1", "x2", etc. Again, the point here is to try and simulate real data for an internal research project I'm working on. I've tried to simulate the fact that sometimes, a biomarker is present in one set and not all the others. This is why the column names aren't all the same and some have names that aren't present in the others.
# bringning in the best data management package there is!
library(dplyr)
# making a couple toy data frames used to try and create dummy examples of all the merged files
set.seed(42)
toy_df1 <- as.data.frame(matrix(data = rnorm(n = 100, mean = 0, sd = 1), nrow = 10, ncol = 10))
toy_df2 <- as.data.frame(matrix(data = rnorm(n = 100, mean = 0, sd = 1), nrow = 10, ncol = 10))
toy_df3 <- as.data.frame(matrix(data = rnorm(n = 100, mean = 0, sd = 1), nrow = 10, ncol = 10))
names(toy_df1) <- c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9", "x10")
names(toy_df2) <- c("x1", "x2", "x3", "x5", "x6", "x7", "x8", "x9", "x10", "x11")
names(toy_df3) <- c("x1", "x3", "x4", "x5", "x7", "x8", "x9", "x10", "x11", "x13")
# adding a dummy SSID to each toy dataframe
toy_df1$SSID <- as.numeric(rep(24001, nrow(toy_df1))) # Sample set ID from the first study
toy_df2$SSID <- as.numeric(rep(24002, nrow(toy_df2))) # Sample set ID from the second study
toy_df3$SSID <- as.numeric(rep(24003, nrow(toy_df3))) # Sample set ID from the third study
# inserting each dummy dataframe into a list object for storage (might be useful later)
toy_data_list <- list(toy_df1, toy_df2, toy_df3)
# merging the toy data sheets into the "Data All"-esque file; this takes each dataframe and stacks them
# and stacks them on top of each other in sequential order of the SSIDs.
toy_data_all <- bind_rows(toy_df1, toy_df2, toy_df3)
My issue is this: I'm having trouble trying to simulate the randomness of the NA occurrences within each "toy" dataframe. For instance, in toy_df_1
, perhaps biomarker x1
was observed in all but the first, third, and seventh individuals. Also still in toy_df_1
, perhaps x3
was only observed in only the even numbered observations, so observations 1, 3, 5, 7, and 9, have NA
values. Etc.
I don't know how to best simulate this random data in each data frame, per each biomarker, and this is what I'm seeking help/input on here. I have provided below the currently (very rough) code ideas I'm working with. My current idea/thought process is this: I want to loop through each toy data frame stored in this toy dataframe list, which I have named toy_data_list
, and then loop through each column within each of these toy data frames and take a random sample of the observations, and replace them with NA
. This is the part I am having trouble with the most I think. I don't want the randomly selected amount of observations to always be 3 or 4. Ideally, sometimes it'd be all 10, 0, 1, 5, 3, 2, etc., see what I mean? The below code is my best attempt at doing this but I'm running into issues with that sample()
function . If anybody has any better ideas or ways of doing this I'm all ears. Thank you!
# adding NA values to the original toy dataframes to simulate real data
amount_sampled <- c(0:10)
for (i in 1:length(df_list)){
for (j in 1:nrow(df_list[[i]])){
df_list[[i]][,j][sample(df_list[[i]][,j], size = sample(amount_sampled, 1))] <- NA
}
}
Here's one possibility, using the tidyverse, as you've already referenced the dplyr package.
lapply(
toy_data_list,
function(x) {
x %>% mutate(
across(
starts_with("x"),
function(.x, probMiss) {
ifelse(runif(nrow(.)) < probMiss, NA, .x)
},
probMiss=0.1
)
)
}
)
[[1]]
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 SSID
1 0.3126459 -0.22142515 1.46026211 -1.62140875 -0.38502931 -0.6746660 0.1067005 0.2994059 -0.9105534 -0.7903800 24001
2 1.1208476 -0.96215911 2.00945844 -0.06857533 1.37178507 -0.7397778 -0.5422158 -0.4377205 0.1139766 0.5933915 24001
3 0.4718309 0.18746019 -0.07793822 -0.06298292 0.15898320 1.1972270 -1.4657003 -0.4179386 -1.0482775 -0.1309050 24001
4 1.1276853 0.01224463 1.35434208 1.07570771 -0.20485705 0.3777403 -0.1340144 NA 0.2833195 1.4239725 24001
5 0.6777231 0.71228374 -1.02718379 1.12910075 -0.85209216 0.2747821 1.2911999 0.4103885 0.1970954 NA 24001
6 NA -0.75372581 -1.04117532 0.45014673 -0.80947218 -0.9039600 -0.4470329 1.2823981 NA -0.1676203 24001
7 2.0610343 0.03818993 -0.87062302 0.05058644 0.53339896 0.4048332 -0.6301996 1.4470047 0.9986759 -1.3834170 24001
8 -0.6503159 1.17842106 0.85313022 -0.12433022 0.06873847 -0.2391438 -0.5809780 3.1036738 -0.1078761 -0.2894947 24001
9 -1.3242377 -0.15320385 -0.62276019 0.91201146 -0.73678148 1.0743194 -2.3201595 0.5837672 2.2087385 -0.9970005 24001
10 0.6594713 -0.16169753 1.21570987 0.20997680 -0.46916210 -0.3147506 -0.6459044 -0.2545514 -1.6828389 -0.3871874 24001
[[2]]
x1 x2 x3 x5 x6 x7 x8 x9 x10 x11 SSID
1 -0.80952120 -0.852213775 -0.10225035 NA 0.35699328 -1.023753807 -1.4154533 0.65442205 -0.6537130 0.8973321 24002
2 -0.86600469 0.899019904 -0.05125182 -1.4588262 -0.94233711 -0.782264488 0.5288418 0.18005542 1.4591325 -2.2149217 24002
3 0.70240734 -0.007170461 NA 1.7886008 -1.15256861 -2.689387176 -2.4385595 0.90978210 -1.5683487 0.4901286 24002
4 -0.16801229 -0.815786121 NA -0.1338524 -0.38385549 -0.329542362 NA NA 0.5224013 0.5429560 24002
5 -0.75408520 0.421949205 NA 2.0317077 0.01031431 0.009952132 0.1447827 0.73121298 1.1986778 0.1234124 24002
6 -0.05423047 0.563885485 1.54842553 0.3135906 -1.04593866 -1.908321655 0.7276715 -0.06894155 1.2526561 0.1046509 24002
7 -0.63881539 1.291600125 0.45648495 0.1310152 -0.70227040 0.666267145 -1.9790600 -0.03900499 -0.2058987 0.4406734 24002
8 -0.15457321 -0.522821584 -1.13708264 -0.9432088 -0.29729621 0.366913581 -0.3035615 NA 0.4077156 0.2557932 24002
9 1.57088231 1.506271800 0.64951476 2.9289056 0.64291036 -0.825838160 NA -1.03556291 -0.7573311 -0.5181969 24002
10 1.20115726 0.552862461 -0.21124735 1.1622098 NA 1.932775511 0.7907731 -1.45997079 -1.5325551 0.1591070 24002
[[3]]
x1 x3 x4 x5 x7 x8 x9 x10 x11 x13 SSID
1 0.36152389 -1.25924885 -0.95076857 0.06161209 -0.54000591 -0.1575174 -0.2598904 -1.7000069 -1.2643910 0.7219130 24003
2 -2.37950462 -0.09150055 0.77677357 NA -0.83614442 -0.3279649 1.2053670 NA 1.8507856 -0.1508595 24003
3 0.75848553 -1.32246030 1.11150668 -0.73996106 NA -1.0307599 -0.3888235 -0.4143549 0.2517389 NA 24003
4 0.61761461 1.09019804 -0.05473478 0.12734300 0.07943265 0.6863735 -0.5544003 1.7183085 -1.3484155 -1.5051942 24003
5 NA -0.40610207 -0.62706647 NA -1.34442562 0.8155673 0.4449365 -0.5956177 0.9458541 -0.9271672 24003
6 -0.03330626 -0.12127479 -1.00838960 NA 1.61370473 1.5947937 -1.6966612 0.1043491 -0.8805581 -0.2534976 24003
7 NA 0.06427669 NA -1.34485334 0.67379844 -1.3158128 -0.7097424 1.1209104 0.5570678 -0.8930275 24003
8 -1.33089411 -2.25094930 0.54535480 NA 1.42531584 0.7146470 -0.5501727 -0.6699045 0.1830541 NA 24003
9 0.56486217 0.63718879 3.40638747 0.57378460 -0.35473065 0.6521152 -0.1936541 -0.9496021 -0.9153089 1.0442554 24003
10 -1.40741364 -0.47946600 0.54382789 0.71474178 NA 0.3009699 0.5828164 -0.4312363 -1.8665087 NA 24003
probMiss
is the desired probability that any one sample is missing.
The inner function,
function(.x, probMiss) {
ifelse(runif(nrow(.)) < probMiss, NA, .x)
},
generates an (unnamed) vector of U(0,1) random variables and then replaces the existing value in the current column with NA
if the corresponding U(0, 1) variate is less than probMiss
.
across(starting_with("x"), ...
applies this function to all columns of the current data frame whose names start with "x".
lapply(toy_data_list, ...
applies this function to each data frame in the toy_data_list
.