I have a dataset of gene lengths that looks a bit like this:
Gene Length
Gene1 5
Gene2 6
Gene3 400000
Gene4 1000
Gene5 25000
Gene6 10
Gene7 50
Gene8 4
Gene9 100
Gene10 2000
The distribution of the Length
is skewed with most genes having a very small size. I'm looking to see if I can make a sample generator that selects a subset of genes that would have differing sizes not just genes that are very small in this range of length data - giving me a list of genes with a mix of lengths in a unbiased/as random a way as possible.
I'm not sure where to start with this. I've been looking into using the sample()
function but I haven't been able to get it to sample and consider that I want a subset of genes with mix of gene lengths, not just those in the very small sizes in the range of lengths I have - is what I'm trying to do possible?
Example input data:
df <- structure(list(Gene = c("Gene1", "Gene2", "Gene3", "Gene4", "Gene5",
"Gene6", "Gene7", "Gene8", "Gene9", "Gene10"), Length = c(5L,
6L, 400000L, 1000L, 25000L, 10L, 50L, 4L, 100L, 2000L)), row.names = c(NA,
-10L), class = c("data.table", "data.frame"))
Edit:
I am now running this code to sample:
genes_selected <- sample(nrow(df), size = 51, replace=FALSE,
prob=dnorm(df$Gene_length,
mean(df$Gene_length),
sd(df$Gene_length)))
But what I would like is to random select genes that are in the the upper tail of the dnorm()
/normal distribution - is this possible to add into prob
of sample()
?
Perhaps this example can help.
Let's assume the Genes follow a Weibull distribution :
classes <- df[order(df$Length)]
classes$density <- dweibull(1:nrow(df), shape=1, scale=1)
classes
Gene Length density
1: Gene8 4 3.678794e-01
2: Gene1 5 1.353353e-01
3: Gene2 6 4.978707e-02
4: Gene6 10 1.831564e-02
5: Gene7 50 6.737947e-03
6: Gene9 100 2.478752e-03
7: Gene4 1000 9.118820e-04
8: Gene10 2000 3.354626e-04
9: Gene5 25000 1.234098e-04
10: Gene3 400000 4.539993e-05
Now let's replicate the classes according to their density:
dfrep <- classes[rep(1:nrow(classes), classes$density*100000)]
We now have a 'skewed' distribution with short Genes (Gene 8, Gene 1, ...) over represented :
table(dfrep$Gene)
Gene1 Gene10 Gene2 Gene3 Gene4 Gene5 Gene6 Gene7 Gene8 Gene9
13533 33 4978 4 91 12 1831 673 36787 247
We can get the classes distribution and calculate a density
classes <- table(dfrep$Gene)
density_calc <- classes/sum(classes)
dfrep$density_calc <- density_calc[match(dfrep$Gene,names(density_calc))]
If we want to sample
equal number of Genes from each class, we can use the prob
argument to sample according to inverse of density_calc
:
table(sample(dfrep$Gene, size=1000, prob=1/dfrep$density_calc, replace=T))
Gene1 Gene10 Gene2 Gene3 Gene4 Gene5 Gene6 Gene7 Gene8 Gene9
101 90 111 96 92 99 102 110 101 98
Which allows to get similar amount of each Gene from the skewed distribution.