Search code examples
rsamplingnormal-distribution

How to sample from skewed data to get a mixed distribution output in r?


I have a dataset of gene lengths that looks a bit like this:

Gene  Length
Gene1  5
Gene2  6
Gene3  400000
Gene4  1000
Gene5  25000
Gene6  10
Gene7  50
Gene8  4
Gene9  100
Gene10 2000

The distribution of the Length is skewed with most genes having a very small size. I'm looking to see if I can make a sample generator that selects a subset of genes that would have differing sizes not just genes that are very small in this range of length data - giving me a list of genes with a mix of lengths in a unbiased/as random a way as possible.

I'm not sure where to start with this. I've been looking into using the sample() function but I haven't been able to get it to sample and consider that I want a subset of genes with mix of gene lengths, not just those in the very small sizes in the range of lengths I have - is what I'm trying to do possible?

Example input data:

df <- structure(list(Gene = c("Gene1", "Gene2", "Gene3", "Gene4", "Gene5", 
"Gene6", "Gene7", "Gene8", "Gene9", "Gene10"), Length = c(5L, 
6L, 400000L, 1000L, 25000L, 10L, 50L, 4L, 100L, 2000L)), row.names = c(NA, 
-10L), class = c("data.table", "data.frame"))

Edit:

I am now running this code to sample:

genes_selected <- sample(nrow(df), size = 51, replace=FALSE, 
                            prob=dnorm(df$Gene_length,
                                       mean(df$Gene_length),
                                       sd(df$Gene_length)))

But what I would like is to random select genes that are in the the upper tail of the dnorm()/normal distribution - is this possible to add into prob of sample()?


Solution

  • Perhaps this example can help.

    Let's assume the Genes follow a Weibull distribution :

    classes <- df[order(df$Length)]
    classes$density <- dweibull(1:nrow(df), shape=1, scale=1)
    classes
    
         Gene Length      density
     1:  Gene8      4 3.678794e-01
     2:  Gene1      5 1.353353e-01
     3:  Gene2      6 4.978707e-02
     4:  Gene6     10 1.831564e-02
     5:  Gene7     50 6.737947e-03
     6:  Gene9    100 2.478752e-03
     7:  Gene4   1000 9.118820e-04
     8: Gene10   2000 3.354626e-04
     9:  Gene5  25000 1.234098e-04
    10:  Gene3 400000 4.539993e-05
    

    enter image description here Now let's replicate the classes according to their density:

    dfrep <- classes[rep(1:nrow(classes), classes$density*100000)]
    

    We now have a 'skewed' distribution with short Genes (Gene 8, Gene 1, ...) over represented :

    table(dfrep$Gene)
    
     Gene1 Gene10  Gene2  Gene3  Gene4  Gene5  Gene6  Gene7  Gene8  Gene9 
     13533     33   4978      4     91     12   1831    673  36787    247 
    

    We can get the classes distribution and calculate a density

    classes <- table(dfrep$Gene)
    density_calc <- classes/sum(classes)
    dfrep$density_calc <- density_calc[match(dfrep$Gene,names(density_calc))]
    

    If we want to sample equal number of Genes from each class, we can use the prob argument to sample according to inverse of density_calc :

    table(sample(dfrep$Gene, size=1000, prob=1/dfrep$density_calc, replace=T))
    
     Gene1 Gene10  Gene2  Gene3  Gene4  Gene5  Gene6  Gene7  Gene8  Gene9 
       101     90    111     96     92     99    102    110    101     98 
    

    Which allows to get similar amount of each Gene from the skewed distribution.