Search code examples
rlistdataframematchsapply

match all occurrences in data frame


I'm trying to do something similar as in this post here: Extract rows for the first occurrence of a variable in a data frame but extract all occurrences, not just the first.

Here is a simplified example: I have this data frame called toDrop

Gene   Taxa
123    A
327    B
445    D
557    A
789    E
123    B
557    C

Here's my code that uses match and thus returns the first match only. I'm running this inside a loop so modifying things here for simplicity.

Gene <- c("123", "327", "445", "557", "789", "123", "557")
Taxa <- c("A", "B", "D", "A", "E", "B", "C")
toDrop <- data.frame(Gene, Taxa)
Temp <- list()
geneNameTemp <- "123"
toDrop[match(geneNameTemp, toDrop$Gene), 2] -> Temp

In this example, Temp should return a list of "A" and "B" I think I need to use lapply as in this post but can't figure it out from that example. Thanks for the help.


Solution

  • There are several ways to do this. One way in base R that is close to what you've already got is which() combined with %in%

    Gene <- c("123", "327", "445", "557", "789", "123", "557")
    Taxa <- c("A", "B", "D", "A", "E", "B", "C")
    toDrop <- data.frame(Gene, Taxa)
    Temp <- list()
    geneNameTemp <- "123"
    Temp <- as.list(toDrop[which(toDrop$Gene %in% geneNameTemp),2])
    Temp
    # [[1]]
    # [1] A
    # Levels: A B C D E
    # 
    # [[2]]
    # [1] B
    # Levels: A B C D E
    

    Will return a list with the two factors. This method can be expanded to vector geneNameTemp, but it will include duplicates if there are any

    Gene <- c("123", "327", "445", "557", "789", "123", "557")
    Taxa <- c("A", "B", "D", "A", "E", "B", "C")
    toDrop <- data.frame(Gene, Taxa)
    Temp <- list()
    geneNameTemp <- c("123", "327")
    Temp <- as.list(toDrop[which(toDrop$Gene %in% geneNameTemp),2])
    Temp
    # [[1]]
    # [1] A
    # Levels: A B C D E
    # 
    # [[2]]
    # [1] B
    # Levels: A B C D E
    # 
    # [[3]]
    # [1] B
    # Levels: A B C D E
    

    If you only need a vector with the factors you can remove as.list(). If you want to remove the duplicates you can use unique(toDrop[which(toDrop$Gene %in% geneNameTemp),2]).