I'm trying to assign NA values to a column.
Data:
df <- data.frame(V1 = c(0, 0, 0, 1, 0, 1, 1, 1, 1, 0),
V2 = c(0, 0, 0, 1, 1, 1, 1, 1, 1, 0),
V3 = c(0, 1, 1, 0, 0, 0, 1, 1, 0, 1))
df
How can I assign 2 NA (10*0.20) values to 'V1'? After this process, a data set will be generated. I want to repeat this process 10 times. The data sets generated after each replication must be different from each other. At the end of the day, I will have 10 data sets to export.
A scalable approach which should be relatively efficient:
n.rep <- 10
n.na <- 2
# stores NA indices which have been used already
na.ind.used <- character(n.rep)
lapply(seq_len(n.rep), \(i) {
repeat {
# candidate indices
na.ind <- sample.int(nrow(df), n.na)
# string representation of indices for comparison
na.ind.str <- paste(sort(na.ind), collapse=' ')
if (!(na.ind.str %in% na.ind.used))
break
}
na.ind.used[[i]] <<- na.ind.str
df$V1[na.ind] <- NA
df
})
Theoretically, we could also enumerate all possible combinations, randomly select from their indices and generate the combinations with these indices directly without any testing for duplicates. However, with large data we might run out of integers for the enumeration and would need a larger data type and at the moment I don't think it's worth the hassle.
Original approach which doesn't scale well. With larger data (1000 rows, 10% or more of missing values leading to 6.385051e+139 or more combinations), this approach is not possible:
To do this effectively, I would use combn
to generate all possible pairs of indices for the two NA
values and then randomly choose 10 of these:
set.seed(0)
ind <- combn(nrow(df), 2)
ind <- ind[, sample.int(ncol(ind), 10)]
apply(ind, 2, \(i) {
df$V1[i] <- NA
df
}, simplify=F)
Output:
[[1]]
V1 V2 V3
1 0 0 0
2 NA 0 1
3 0 0 1
4 1 1 0
5 0 1 0
6 1 1 0
7 NA 1 1
8 1 1 1
9 1 1 0
10 0 0 1
[[2]]
V1 V2 V3
1 NA 0 0
2 0 0 1
3 0 0 1
4 1 1 0
5 NA 1 0
6 1 1 0
7 1 1 1
8 1 1 1
9 1 1 0
10 0 0 1
[[3]]
V1 V2 V3
1 0 0 0
2 0 0 1
3 0 0 1
4 1 1 0
5 0 1 0
6 NA 1 0
7 1 1 1
8 1 1 1
9 1 1 0
10 NA 0 1
[[4]]
V1 V2 V3
1 NA 0 0
2 NA 0 1
3 0 0 1
4 1 1 0
5 0 1 0
6 1 1 0
7 1 1 1
8 1 1 1
9 1 1 0
10 0 0 1
[[5]]
V1 V2 V3
1 0 0 0
2 0 0 1
3 0 0 1
4 1 1 0
5 NA 1 0
6 1 1 0
7 1 1 1
8 1 1 1
9 NA 1 0
10 0 0 1
[[6]]
V1 V2 V3
1 0 0 0
2 0 0 1
3 NA 0 1
4 1 1 0
5 0 1 0
6 1 1 0
7 1 1 1
8 1 1 1
9 NA 1 0
10 0 0 1
[[7]]
V1 V2 V3
1 0 0 0
2 0 0 1
3 0 0 1
4 1 1 0
5 0 1 0
6 1 1 0
7 1 1 1
8 1 1 1
9 NA 1 0
10 NA 0 1
[[8]]
V1 V2 V3
1 0 0 0
2 0 0 1
3 NA 0 1
4 NA 1 0
5 0 1 0
6 1 1 0
7 1 1 1
8 1 1 1
9 1 1 0
10 0 0 1
[[9]]
V1 V2 V3
1 0 0 0
2 0 0 1
3 0 0 1
4 1 1 0
5 NA 1 0
6 1 1 0
7 1 1 1
8 NA 1 1
9 1 1 0
10 0 0 1
[[10]]
V1 V2 V3
1 0 0 0
2 0 0 1
3 NA 0 1
4 1 1 0
5 0 1 0
6 1 1 0
7 NA 1 1
8 1 1 1
9 1 1 0
10 0 0 1