Search code examples
rmergeduplicatesuniquena

Set a variable to NA if other variables are duplicates in R


I have the following data frame containing medicine code according to the route of administration:

code <- data.frame(inn = c("ibuprofen", "ibuprofen", "ibuprofen", "fusidic acid", "fusidic acid"),
                   route = c("unknown", "unknown", "unknown", "oral", "topical"),
                   atc = c("R02AX02", "G02CC01", "M01AE01", "J01XC01", "D06AX01"))

           inn   route     atc
1    ibuprofen unknown R02AX02
2    ibuprofen unknown G02CC01
3    ibuprofen unknown M01AE01
4 fusidic acid    oral J01XC01
5 fusidic acid topical D06AX01

And another one containing patient treatment and event:

event <- data.frame(id = c(1, 1, 2),
                    inn = c("ibuprofen", "fusidic acid", "fusidic acid"),
                    route = c("unknown", "oral", "topical"),
                    event = c(TRUE, FALSE, TRUE))

  id          inn   route event
1  1    ibuprofen unknown  TRUE
2  1 fusidic acid    oral FALSE
3  2 fusidic acid topical  TRUE

I need to merge those data frames to get the following result:

           inn   route id event     atc
1 fusidic acid    oral  1 FALSE J01XC01
2 fusidic acid topical  2  TRUE D06AX01
3    ibuprofen unknown  1  TRUE NA

I do not get this result with a simple merge:

merge(x = event,
      y = code)

           inn   route id event     atc
1 fusidic acid    oral  1 FALSE J01XC01
2 fusidic acid topical  2  TRUE D06AX01
3    ibuprofen unknown  1  TRUE R02AX02
4    ibuprofen unknown  1  TRUE G02CC01
5    ibuprofen unknown  1  TRUE M01AE01

I thought of two solutions, but I did not manage to implement any:

  • modify the code data frame before the merge to set atc to NA if there is different atc for a group of inn and route (this seems more appropriate)
  • modify the result of merge to set atc to NA if there is different atc for a group of inn, route and id

How can I do it in base R? Is there another better way? I work in a restrictive environment where I only have access to base R.


Solution

  • Here's a straightforward way to accomplish option 2. Starting with the result of a simple merge:

    mrg <- merge(x = event,
                 y = code)
    
               inn   route id event     atc
    1 fusidic acid    oral  1 FALSE J01XC01
    2 fusidic acid topical  2  TRUE D06AX01
    3    ibuprofen unknown  1  TRUE R02AX02
    4    ibuprofen unknown  1  TRUE G02CC01
    5    ibuprofen unknown  1  TRUE M01AE01
    

    We then check which rows are duplicated (dropping the atc variable). We need to use duplicated twice because it actually finds duplicate rows, not rows that have duplicates. So, it would catch rows 4 and 5, but not 3 – to get that, we need to repeat duplicated from the opposite direction. Read more here: Finding ALL duplicate rows, including “elements with smaller subscripts”:

    mrg$atc <- ifelse(duplicated(mrg[,-5]) | duplicated(mrg[,-5], fromLast = T),
                      NA,
                      mrg$atc)
    mrg
    
               inn   route id event     atc
    1 fusidic acid    oral  1 FALSE J01XC01
    2 fusidic acid topical  2  TRUE D06AX01
    3    ibuprofen unknown  1  TRUE    <NA>
    4    ibuprofen unknown  1  TRUE    <NA>
    5    ibuprofen unknown  1  TRUE    <NA>
    

    If you want to get rid of the duplicate rows 4 and 5, just run duplicated one more time to drop them:

    mrg[!duplicated(mrg),]
    
               inn   route id event     atc
    1 fusidic acid    oral  1 FALSE J01XC01
    2 fusidic acid topical  2  TRUE D06AX01
    3    ibuprofen unknown  1  TRUE    <NA>