I have a two very large lists of genes, A and B. A has two columns: GeneID and p-value, while B has only one column, GeneID. There are approximately 100,000 Genes in B and these are a subset of the genes in A (about 700,000 Genes here):
GeneListA GeneID p.value 41931 0.0210 41931 0.0003 5310612 0.3161 5310612 0.7089 5310612 0.0021 98317 0.1139 98317 0.0009 215688 0.0031 215688 0.0008 GeneListB GeneID 41931 41931 215688 215688 Desired GeneListC 5310612 0.3161 5310612 0.7089 5310612 0.0021 98317 0.1139 98317 0.0009
I do not want the genes in B to show up in A anymore. How do I get rid of them while still keeping my p-values in A? I tried three different methods so far:
I got rid of my p-value column so there is only Entrez Gene ID's for both lists. Then I employed the following code: new<-A[setdiff(rownames(A),rownames(B)),]
, but I got a completely different set of genes than expected. It was a seemingly random mixture of genes from A and B, rather than A-B
I also tried: new<-A[!apply(A,1,FUN=function(y){any(apply(B,1,FUN=function(x){all(x==y)}))}),]
I'm getting destroyed by this, so any help would be appreciated.
You can subset the data frame by the %in%
operator.
GeneListA[!GeneListA$GeneID %in% GeneListB$GeneID, ]
Combined with !
the statement becomes, give me all in GeneListA where GeneID is not in GendeID from GeneListB.