Search code examples
rdataframecomparecomparison

Finding difference between two dataframes by a column is not working


I have two large datasets in newdata & olddata dataframes with ID as primary column key & 30 other columns. While most of the IDs are common between new & old, there are few old IDs that are deleted & not present in new and vice versa, new IDs are added in New that were not present in Old.

now, i am trying to put label to all 3 categories as Common, New & Old and while New & Common are working fine, i am not having so much success with Old. Not Sure what am i missing and any pointers would be appreciated

commonIDs<-data.frame(intersect(old_data$IDs,new_data$IDs))

#New Ids
added <- NULL
added <- new_data[!new_data$IDs %in% commonIDs$IDs,]
added <- data.frame(Remarks ="New", added)


#Deleted IDs
deleted <- NULL
deleted <- old_data[!old_data$IDs %in% commonIDs$IDs,]

I also tried another approach but still not luck

new <- data.frame(new_data[is.na(match(new_data$IDs,commonIDs$IDs)),])
old <- data.frame(old_data[is.na(match(old_data$IDs,commonIDs$IDs)),])

Solution

  • You might also consider trying a package developed around this use case.

    Using @r2evans' data:

    dat <- data.frame(id = 1:10)
    dat1 <- dat[-(3:4),,drop=FALSE]
    dat2 <- dat[-(6:10),,drop=FALSE]
    

    We could run:

    library(waldo)
    compare(dat1, dat2)
    

    and get this output (pasted as picture to retain color formatting), which shows us that dat1 is missing the data and rownames of rows 3-4, and dat2 is missing those for rows 6-10.

    enter image description here