Search code examples
rdataframemergecompareset-difference

Select rows from one data.frame that are not present in a second data.frame


I have two data.frames:

a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])

I want to find the rows a1 have that a2 doesn't.

Is there a built in function for this type of operation?

(p.s: I did write a solution for it, I am simply curious if someone already made a more crafted code)

Here is my solution:

a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])

rows.in.a1.that.are.not.in.a2  <- function(a1,a2)
{
    a1.vec <- apply(a1, 1, paste, collapse = "")
    a2.vec <- apply(a2, 1, paste, collapse = "")
    a1.without.a2.rows <- a1[!a1.vec %in% a2.vec,]
    return(a1.without.a2.rows)
}
rows.in.a1.that.are.not.in.a2(a1,a2)

Solution

  • This doesn't answer your question directly, but it will give you the elements that are in common. This can be done with Paul Murrell's package compare:

    library(compare)
    a1 <- data.frame(a = 1:5, b = letters[1:5])
    a2 <- data.frame(a = 1:3, b = letters[1:3])
    comparison <- compare(a1,a2,allowAll=TRUE)
    comparison$tM
    #  a b
    #1 1 a
    #2 2 b
    #3 3 c
    

    The function compare gives you a lot of flexibility in terms of what kind of comparisons are allowed (e.g. changing order of elements of each vector, changing order and names of variables, shortening variables, changing case of strings). From this, you should be able to figure out what was missing from one or the other. For example (this is not very elegant):

    difference <-
       data.frame(lapply(1:ncol(a1),function(i)setdiff(a1[,i],comparison$tM[,i])))
    colnames(difference) <- colnames(a1)
    difference
    #  a b
    #1 4 d
    #2 5 e