Search code examples
rtime-seriesfilteringsubsetdatamatrix

Extracting rows from data frame based on another data frame


I'm trying to extract a set of genes (row names) from my large data set based on another data matrix that contains a list of my genes of interest. I've read about that I should use the filter and %in% command, but am unsure as to how to write it properly.

example: my large database:

Gene        Week1         Week 2.        Week 3
A.           20.           14.            5
B.           5.            10.            15
C.           2.            4.             6
D.           20.           18.            19

my small data base:

Gene
A
C
D

And I want my result to be:

Gene        Week1         Week 2.        Week 3
A.           20.           14.            5
C.           2.            4.             6
D.           20.           18.            19

Could anybody please help out? I'd really appreciate it and my apologies for the rather simple question :)


Solution

  • Using logical row indexes:

    large_database[large_database$Gene %in% unique(small_data_base$Gene), ]
    

    Explanation:

    large_database$Gene %in% unique(small_data_base$Gene)
    

    Checks for each entry (i.e. row) in large_database$Gene if it appears in unique(small_database$Gene) i.e. the list of unique values in the column Gene of small_data_base and returns a boolean vector (a vector of TRUE and FALSE).

    We then can use this vector as a row 'index' to selecet only rows where the vector is TRUE (i.e. the value of large_database$Gene was in unique(small_database$Gene)