Search code examples
rdplyr

Replace subset of dataframe (rows and columns)


I have a dataframe o which contains incorrect data. I have another dataframe r with the correct data. I want to replace the subset of o (both rows and columns) with r. I'm seeking a solution that scales to many variables / columns. (I do have a vector rv with the column names of the variables that need replacing)

o <- tibble(id = c(2, 3, 4, 7, 10, 2), aa = letters[1:6], bb = LETTERS[2:7], cc = 11:16)
r <- tibble(id = c(2,4), cc = c(111, 113), aa = c("aa", "cc"))
rv <- c("aa", "cc")

> o
# A tibble: 6 × 4
     id aa    bb       cc
  <dbl> <chr> <chr> <int>
1     2 a     B        11
2     3 b     C        12
3     4 c     D        13
4     7 d     E        14
5    10 e     F        15
6     2 f     G        16

> r
# A tibble: 2 × 3
     id    cc aa   
  <dbl> <dbl> <chr>
1     2   111 aa   
2     4   113 cc   

I want to end up with (I don't care about the order of rows or columns):

> o
# A tibble: 6 × 4
     id aa    bb       cc
  <dbl> <chr> <chr> <int>
1     2 aa    B       111
2     3 b     C        12
3     4 cc    D       113
4     7 d     E        14
5    10 e     F        15
6     2 aa    G       111

You can assume that all ids in r are in o, but you cannot assume that columns or rows are the in the same order, and there are duplicated id in o (not in r), where the relevant columns should also be overwritten.

I tried base R, but the duplicate id in o stumps it:

 o[o$id %in% r$id, rv] <- r

I thought of dplyr using left_join but don't know how to overwrite rather than attach.

My question is different from similar ones because there are several columns to replace, but not the entire dataset -- on which I had found useful answers.


Solution

  • If the question is looking for a base R solution that fixes up the attempt in the question then note that we can only write o[...] <- r with just r on the right hand side if it is sufficient to replace just the first occurrence of each id in o. If that were ok then we could fix it like this:

    # only replaces first instance of each r id in o
    o[match(r$id, o$id), names(r)] <- r
    

    With duplicate id's in o, as in the example data, we will need a loop; however, it is still only one line of code:

    # replace all instances of each id so can be used with example data
    for(id in r$id) o[o$id == id, names(r)] <- r[r$id == id, ]
    

    An equivalent alternative to the explicit loop is to use Reduce like this:

    upd <- function(o, id) {
      o[o$id == id, names(r)] <- r[r$id == id, ]
      o
    }
    Reduce(upd, r$id, init = o)