I have a data frame (DF) that looks as follows:
structure(list(ID = c("123", "123", "456", "789", "789"), REPORTER = c("ONE",
"ONE", "TWO", "THREE", "THREE"), VALUE1 = c("1", "1", "2", "1",
"1"), VALUE3 = c("2", "1", "1", "2", "1"), VALUE4 = c("2", "1",
"2", "1", "1")), .Names = c("ID", "REPORTER", "VALUE1", "VALUE3",
"VALUE4"), row.names = c(1L, 2L, 3L, 5L, 6L), class = "data.frame")
Uniqueness in this case is defined by ID and REPORTER. So the DF above contains a duplicate for the ID 123 and REPORTER ONE and the ID 789 and REPORTER THREE. Since I cannot tell which values of VALUE1 to VALUE4 are the correct ones, I like to set all values to NA, that differ within a duplicate.
This means I first have to identify the columns of VALUE that contain different values. These are the ones to be set to NA. For the rest I like to keep the data since here I can tell the value is correct.
The expected output would look like this:
structure(list(ID = c("123", "123", "456", "789", "789"), REPORTER = c("ONE",
"ONE", "TWO", "THREE", "THREE"), VALUE1 = c("1", "1", "2", "1",
"1"), VALUE3 = c(NA, NA, "1", NA, NA), VALUE4 = c(NA, NA, "2",
"1", "1")), .Names = c("ID", "REPORTER", "VALUE1", "VALUE3",
"VALUE4"), row.names = c(1L, 2L, 3L, 5L, 6L), class = "data.frame")
The goal is to ensure data quality. I don't like to just remove the problem cases since I can use the not differing values for analysis. But I also do not like to just use one of the rows because this would lead to wrong conclusions if I had chosen the wrong values.
How can I do this?
I think this is what you are looking for:
library(reshape2)
DFL <- melt(cbind(rn = 1:nrow(DF), DF), id.vars=c("rn", "ID", "REPORTER"))
DFL$value2 <- ave(DFL$value, DFL[c("ID", "REPORTER", "variable")],
FUN = function(x) {
ifelse(length(unique(x)) > 1, NA, x)
})
dcast(DFL, rn + ID + REPORTER ~ variable, value.var = "value2")
# rn ID REPORTER VALUE1 VALUE3 VALUE4
# 1 1 123 ONE 1 <NA> <NA>
# 2 2 123 ONE 1 <NA> <NA>
# 3 3 456 TWO 2 1 2
# 4 4 789 THREE 1 <NA> 1
# 5 5 789 THREE 1 <NA> 1
As you can see, I had to add a dummy "rn" supplementary ID variable to make sure that dcast
wouldn't just collapse all the values into one row per ID+REPORTER combination.
This is actually also entirely doable with base R's reshape
and the ave
step described above:
DFL <- reshape(DF, direction = "long",
varying = grep("VALUE", names(DF)), sep = "")
DFL <- within(DFL, {
VALUE <- ave(VALUE, ID, REPORTER, time, FUN = function(x)
ifelse(length(unique(x)) > 1, NA, x))
})
reshape(DFL)
# ID REPORTER id VALUE1 VALUE3 VALUE4
# 1.1 123 ONE 1 1 <NA> <NA>
# 2.1 123 ONE 2 1 <NA> <NA>
# 3.1 456 TWO 3 2 1 2
# 4.1 789 THREE 4 1 <NA> 1
# 5.1 789 THREE 5 1 <NA> 1
In the last line above, the attributes
from the original reshape
statement make it so we don't have to even worry about what arguments we need to put in. :-)