I'm trying to extract a set of genes (row names) from my large data set based on another data matrix that contains a list of my genes of interest. I've read about that I should use the filter and %in% command, but am unsure as to how to write it properly.
example: my large database:
Gene Week1 Week 2. Week 3
A. 20. 14. 5
B. 5. 10. 15
C. 2. 4. 6
D. 20. 18. 19
my small data base:
Gene
A
C
D
And I want my result to be:
Gene Week1 Week 2. Week 3
A. 20. 14. 5
C. 2. 4. 6
D. 20. 18. 19
Could anybody please help out? I'd really appreciate it and my apologies for the rather simple question :)
Using logical row indexes:
large_database[large_database$Gene %in% unique(small_data_base$Gene), ]
Explanation:
large_database$Gene %in% unique(small_data_base$Gene)
Checks for each entry (i.e. row) in large_database$Gene
if it appears in unique(small_database$Gene)
i.e. the list of unique values in the column Gene
of small_data_base
and returns a boolean vector (a vector of TRUE and FALSE).
We then can use this vector as a row 'index' to selecet only rows where the vector is TRUE (i.e. the value of large_database$Gene
was in unique(small_database$Gene)