Search code examples
rdataframedata-cleaningrecode

Copying information over when there are multiple datapoints


I have a data cleaning question. The data collection happened three times and sometimes the data entry was incorrect. Therefore, if the students had their data collected more than one time, the second data point needs to be copied over.

Here is my dataset looks like:

df <- data.frame(id = c(1,1,1, 2,2,2, 3,3,  4,4, 5),
                 text = c("female","male","male", "female","female","female", "male","female","male", "female", "female"),
                 time = c("first","second","third", "first","second","third", "first","second","second", "third", "first"))
            
> df
   id   text   time
1   1 female  first
2   1   male second
3   1   male  third
4   2 female  first
5   2 female second
6   2 female  third
7   3   male  first
8   3 female second
9   4   male second
10  4 female  third
11  5 female  first

So id 1,3, and 4 have incorrect gender information. I need to copy the second data point when there are multiple/different input about the gender variable. If there is only one datapoint, that should stay in the dataset.

The desired output would be

> df1
   id   text   time
1   1   male  first
2   1   male second
3   1   male  third
4   2 female  first
5   2 female second
6   2 female  third
7   3 female  first
8   3 female second
9   4   male second
10  4   male  third
11  5 female  first

Any ideas? Thanks!


Solution

  • We could use match

    library(dplyr)
    df %>% 
      group_by(id) %>%
      mutate(text = text[match("second", time, nomatch = 1)]) %>%
      ungroup
    

    -output

    # A tibble: 11 × 3
          id text   time  
       <dbl> <chr>  <chr> 
     1     1 male   first 
     2     1 male   second
     3     1 male   third 
     4     2 female first 
     5     2 female second
     6     2 female third 
     7     3 female first 
     8     3 female second
     9     4 male   second
    10     4 male   third 
    11     5 female first 
    

    Or using coalesce

    df %>% 
      group_by(id) %>%
      mutate(text = coalesce(text[match("second", time)], text)) %>%
      ungroup
    

    -output

    # A tibble: 11 × 3
          id text   time  
       <dbl> <chr>  <chr> 
     1     1 male   first 
     2     1 male   second
     3     1 male   third 
     4     2 female first 
     5     2 female second
     6     2 female third 
     7     3 female first 
     8     3 female second
     9     4 male   second
    10     4 male   third 
    11     5 female first