Search code examples
rduplicatesduplicate-datana

Setting different values in duplicated observations to NA


I have a data frame (DF) that looks as follows:

structure(list(ID = c("123", "123", "456", "789", "789"), REPORTER = c("ONE", 
"ONE", "TWO", "THREE", "THREE"), VALUE1 = c("1", "1", "2", "1", 
"1"), VALUE3 = c("2", "1", "1", "2", "1"), VALUE4 = c("2", "1", 
"2", "1", "1")), .Names = c("ID", "REPORTER", "VALUE1", "VALUE3", 
"VALUE4"), row.names = c(1L, 2L, 3L, 5L, 6L), class = "data.frame")

Uniqueness in this case is defined by ID and REPORTER. So the DF above contains a duplicate for the ID 123 and REPORTER ONE and the ID 789 and REPORTER THREE. Since I cannot tell which values of VALUE1 to VALUE4 are the correct ones, I like to set all values to NA, that differ within a duplicate.

This means I first have to identify the columns of VALUE that contain different values. These are the ones to be set to NA. For the rest I like to keep the data since here I can tell the value is correct.

The expected output would look like this:

structure(list(ID = c("123", "123", "456", "789", "789"), REPORTER = c("ONE", 
"ONE", "TWO", "THREE", "THREE"), VALUE1 = c("1", "1", "2", "1", 
"1"), VALUE3 = c(NA, NA, "1", NA, NA), VALUE4 = c(NA, NA, "2", 
"1", "1")), .Names = c("ID", "REPORTER", "VALUE1", "VALUE3", 
"VALUE4"), row.names = c(1L, 2L, 3L, 5L, 6L), class = "data.frame")

The goal is to ensure data quality. I don't like to just remove the problem cases since I can use the not differing values for analysis. But I also do not like to just use one of the rows because this would lead to wrong conclusions if I had chosen the wrong values.

How can I do this?


Solution

  • I think this is what you are looking for:

    library(reshape2)
    DFL <- melt(cbind(rn = 1:nrow(DF), DF), id.vars=c("rn", "ID", "REPORTER"))
    DFL$value2 <- ave(DFL$value, DFL[c("ID", "REPORTER", "variable")], 
                      FUN = function(x) {
                        ifelse(length(unique(x)) > 1, NA, x)
                      })
    dcast(DFL, rn + ID + REPORTER ~ variable, value.var = "value2")
    #   rn  ID REPORTER VALUE1 VALUE3 VALUE4
    # 1  1 123      ONE      1   <NA>   <NA>
    # 2  2 123      ONE      1   <NA>   <NA>
    # 3  3 456      TWO      2      1      2
    # 4  4 789    THREE      1   <NA>      1
    # 5  5 789    THREE      1   <NA>      1
    

    As you can see, I had to add a dummy "rn" supplementary ID variable to make sure that dcast wouldn't just collapse all the values into one row per ID+REPORTER combination.


    Update

    This is actually also entirely doable with base R's reshape and the ave step described above:

    DFL <- reshape(DF, direction = "long", 
                   varying = grep("VALUE", names(DF)), sep = "")
    DFL <- within(DFL, {
      VALUE <- ave(VALUE, ID, REPORTER, time, FUN = function(x) 
        ifelse(length(unique(x)) > 1, NA, x))
    })
    reshape(DFL)
    #      ID REPORTER id VALUE1 VALUE3 VALUE4
    # 1.1 123      ONE  1      1   <NA>   <NA>
    # 2.1 123      ONE  2      1   <NA>   <NA>
    # 3.1 456      TWO  3      2      1      2
    # 4.1 789    THREE  4      1   <NA>      1
    # 5.1 789    THREE  5      1   <NA>      1
    

    In the last line above, the attributes from the original reshape statement make it so we don't have to even worry about what arguments we need to put in. :-)