I am an R beginner. I am using R to analyse my large next-generation sequencing vcf file and am having some difficulties. I have imported the very large vcf file as a data frame (2446824 obs. of 177 variables) and made a subset with just the 3 samples I am interested in (2446824 obs. of 29 variables).
I now wish to reduce the dimensions even further (reduce the rows to around 200000). I have been trying to use grep, but cannot get it to work. The error I get is
Error in "0/1" | "1/0" :
operations are possible only for numeric, logical or complex types
Here is a small example part of the file I am working with.
Chr Start End Ref Alt Func.refGene INFO FORMAT Run.Sample1 Run.Sample2 Run.Sample3
489 1 909221 909221 T C PASS GT:AD:DP:GQ:PL 0/1:11,0:11:33:0,33,381 ./. ./.
490 1 909238 909238 G C PASS GT:AD:DP:GQ:PL 0/1:11,6:17:99:171,0,274 0/1:6,5:11:99:159,0,116 1/1:0,15:15:36:441,36,0
491 1 909242 909242 A G PASS GT:AD:DP:GQ:PL 0/1:16,4:13:45:0,45,532 0/0:11,0:11:30:0,30,366 0/0:16,0:17:39:0,39,479
492 1 909309 909309 T C PASS GT:AD:DP:GQ:PL 0/0:23,0:23:54:0,54,700 0/0:15,1:16:36:0,36,463 0/0:19,0:19:48:0,48,598
There are two different ways in which I would like to reduce the rows in this dataset:
Code 1. If either $Run.Sample1 or $Run.Sample2 or $Run.Sample3 contains a “0/1” or “1/0” or “1/1” keep the entire row
Code 2. If $Run.Sample1 or $Run.Sample2 contain either a “0/1” or “1/0” or “1/1” and $Run.Sample3 contain “0/0” keep the entire row
The results I would want to get from code 1 are:
Chr Start End Ref Alt Func.refGene INFO FORMAT Run.Sample1 Run.Sample2 Run.Sample3
489 1 909221 909221 T C PASS GT:AD:DP:GQ:PL 0/1:11,0:11:33:0,33,381 ./. ./.
490 1 909238 909238 G C PASS GT:AD:DP:GQ:PL 0/1:11,6:17:99:171,0,274 0/1:6,5:11:99:159,0,116 1/1:0,15:15:36:441,36,0
491 1 909242 909242 A G PASS GT:AD:DP:GQ:PL 0/1:16,4:13:45:0,45,532 0/0:11,0:11:30:0,30,366 0/0:16,0:17:39:0,39,479
The results I would want to get from code 2 are:
Chr Start End Ref Alt Func.refGene INFO FORMAT Run.Sample1 Run.Sample2 Run.Sample3
489 1 909221 909221 T C PASS GT:AD:DP:GQ:PL 0/1:11,0:11:33:0,33,381 ./. ./.
491 1 909242 909242 A G PASS GT:AD:DP:GQ:PL 0/1:16,4:13:45:0,45,532 0/0:11,0:11:30:0,30,366 0/0:16,0:17:39:0,39,479
Many thanks for your help
Kelly
Try For the first case:
dat[Reduce(`|`,lapply(dat[9:11], function(x) grepl("0/1|1/0|1/1", x))),]
For the second case based on the conditions mentioned:
dat[ Reduce(`|`,lapply(dat[9:10], function(x) grepl("0/1|1/0|1/1", x)))
& grepl("0/0", dat[,11]),]
dat[ Reduce(`|`,lapply(dat[9:10], function(x) grepl("0/1|1/0|1/1", x)))
& grepl("\\.\\/\\.|0/0", dat[,11]),]