Search code examples
rdataframedata-cleaning

Using a key to clean data in two corresponding columns


I have a large data frame (6 million rows, 20 columns) where data in one column corresponds to data in another column. I created a key that I now want to use to fix rows that have the wrong value. As a small example:

key = data.frame(animal = c('dog', 'cat', 'bird'), 
                  sound = c('bark', 'meow', 'chirp'))

The data frame looks like this (minus the other columns of data):

 df = data.frame(id = c(1, 2, 3, 4),
                     animal = c('dog', 'cat', 'bird', 'cat'), 
                     sound = c('meow', 'bark', 'chirp', 'chirp'))

I swear I have done this before but can't remember my solution. Any ideas?


Solution

  • Using dplyr. If you want to fix sound according to animal,

    library(dplyr)
    df <- df %>% 
      mutate(sound = sapply(animal, function(x){key %>% filter(animal==x) %>% pull(sound)}))
    

    should do the trick. If you want to fix animal according to sound:

    df <- df %>% 
      mutate(animal = sapply(sound, function(x){key %>% filter(sound==x) %>% pull(animal)}))