This is the data frame I am using and I am trying to subsample column V2(position) evenly (min:1130, max: 4406748) in a way that there is only one representative of column V4(lineage) in the final sample. I am trying to sample in a way that positions are evenly distributed while ensuring that I include only 1 representative of each group in the entire sample.
I have tried sorting and binning data but I cannot figure out how to evenly sample from it in a way that only 1 representative lineage is present in the data frame.
sorted_barcodes <- tb_profiler_barcodes %>% arrange(V2)
# bin the data to N bins
binned_sorted <- sorted_df %>%
mutate(bin = cut(V2, breaks = 150, labels = FALSE))
I would appreciate your help.
This could be approached as an assignment problem, with the cost equal to the distance from an "ideal" distribution of V2
(position) values.
First get the ideal spacing.
r <- range(tb_profiler_barcodes$V2)
n <- length(unique(tb_profiler_barcodes$V4))
ideal <- seq(0.5, n - 0.5)*diff(r)/n + r[1] # ideal "even" spacing
Get the distance between each value in V2
and the ideal sample locations.
d <- outer(tb_profiler_barcodes$V2, ideal, \(x, y) abs(x - y))
For each value in V4
(lineage), get the best candidate for each ideal location (my go-to is data.table
for group operations). This is the row number of the column minimum by lineage.
library(data.table)
idx <- as.matrix(
cbind(data.table(lineage = tb_profiler_barcodes$V4)[,ID := .I], d)[
,lapply(.SD, \(x) ID[which.min(x)]), lineage, .SDcols = 3:(n + 2)
][,lineage := NULL]
)
Get the distance for each row index.
mindists <- idx
mindists[] <- d[cbind(c(idx), c(col(idx)))]
Solve the assignment problem and take the samples.
samples <- tb_profiler_barcodes[
idx[RcppHungarian::HungarianSolver(mindists)$pairs],
]
You can check the distribution with, e.g., plot(sort(samples$V2))
, which will show the points are nearly linear.