I have a list of genes with 1-3 probes for each gene, and an intensity value for each probe. An example is as follows:
GENE_ID Probes Intensity
GENE:JGI_V11_100009 GENE:JGI_V11_1000090102 253.479375
GENE:JGI_V11_100009 GENE:JGI_V11_1000090202 712.235625
GENE:JGI_V11_100036 GENE:JGI_V11_1000360103 449.065625
GENE:JGI_V11_100036 GENE:JGI_V11_1000360203 641.341875
GENE:JGI_V11_100036 GENE:JGI_V11_1000360303 1237.07125
GENE:JGI_V11_100044 GENE:JGI_V11_1000440101 456.133125
GENE:JGI_V11_100045 GENE:JGI_V11_1000450101 369.790625
GENE:JGI_V11_100062 GENE:JGI_V11_1000620102 2839.97375
GENE:JGI_V11_100062 GENE:JGI_V11_1000620202 6384.55125
I want to determine the variance between the probes for each individual gene (so for every gene I hve a variance value)
I am aware that I should use the tapply() function but dont know how to accomplish this other than:
tapply( , , var)
You can use data.table
or dplyr
to accomplish this. This is a classic group_by
case:
library(dplyr)
df %>%
group_by(GENE_ID) %>%
mutate(new_var = var(Intensity))
library(data.table)
setDT(df)
df[, new_var := var(Intensity), .(GENE_ID)]
Output in both the cases comes:
GENE_ID Probes Intensity new_var
1: GENE:JGI_V11_100009 GENE:JGI_V11_1000090102 253.4794 105228.6
2: GENE:JGI_V11_100009 GENE:JGI_V11_1000090202 712.2356 105228.6
3: GENE:JGI_V11_100036 GENE:JGI_V11_1000360103 449.0656 168802.8
4: GENE:JGI_V11_100036 GENE:JGI_V11_1000360203 641.3419 168802.8
5: GENE:JGI_V11_100036 GENE:JGI_V11_1000360303 1237.0712 168802.8
6: GENE:JGI_V11_100044 GENE:JGI_V11_1000440101 456.1331 NA
7: GENE:JGI_V11_100045 GENE:JGI_V11_1000450101 369.7906 NA
8: GENE:JGI_V11_100062 GENE:JGI_V11_1000620102 2839.9738 6282014.8
9: GENE:JGI_V11_100062 GENE:JGI_V11_1000620202 6384.5513 6282014.8