Search code examples
rdplyrtidyversesampling

How to sample evenly from a data frame having only one representative per group?


This is the data frame I am using and I am trying to subsample column V2(position) evenly (min:1130, max: 4406748) in a way that there is only one representative of column V4(lineage) in the final sample. I am trying to sample in a way that positions are evenly distributed while ensuring that I include only 1 representative of each group in the entire sample.

I have tried sorting and binning data but I cannot figure out how to evenly sample from it in a way that only 1 representative lineage is present in the data frame.

sorted_barcodes <- tb_profiler_barcodes %>% arrange(V2)
# bin the data to N bins
binned_sorted <- sorted_df %>%
  mutate(bin = cut(V2, breaks = 150, labels = FALSE)) 

I would appreciate your help.


Solution

  • This could be approached as an assignment problem, with the cost equal to the distance from an "ideal" distribution of V2 (position) values.

    First get the ideal spacing.

    r <- range(tb_profiler_barcodes$V2)
    n <- length(unique(tb_profiler_barcodes$V4))
    ideal <- seq(0.5, n - 0.5)*diff(r)/n + r[1] # ideal "even" spacing
    

    Get the distance between each value in V2 and the ideal sample locations.

    d <- outer(tb_profiler_barcodes$V2, ideal, \(x, y) abs(x - y))
    

    For each value in V4 (lineage), get the best candidate for each ideal location (my go-to is data.table for group operations). This is the row number of the column minimum by lineage.

    library(data.table)
    
    idx <- as.matrix(
      cbind(data.table(lineage = tb_profiler_barcodes$V4)[,ID := .I], d)[
              ,lapply(.SD, \(x) ID[which.min(x)]), lineage, .SDcols = 3:(n + 2)
            ][,lineage := NULL]
    )
    

    Get the distance for each row index.

    mindists <- idx
    mindists[] <- d[cbind(c(idx), c(col(idx)))]
    

    Solve the assignment problem and take the samples.

    samples <- tb_profiler_barcodes[
      idx[RcppHungarian::HungarianSolver(mindists)$pairs],
    ]
    

    You can check the distribution with, e.g., plot(sort(samples$V2)), which will show the points are nearly linear.