Search code examples
rdata-cleaning

Picking the last value when there are multiple incorrect datapoints in R


I have a data cleaning question. The data collection happened three times and sometimes the data entry was incorrect. Therefore, if the students had their data collected more than one time, the last data point needs to be copied over.

Here is my dataset looks like:

df <- data.frame(id = c(1,1,1, 2,2,2, 3,3,3, 4),
                 text = c("female","male","male", "female","female","female", "male","female","female", "male"),
                 time = c("first","second","third", "first","second","third", "first","second","third", "first"))
    
> df
   id   text   time
1   1 female  first
2   1   male second
3   1   male  third
4   2 female  first
5   2 female second
6   2 female  third
7   3   male  first
8   3 female second
9   3 female  third
10  4   male  first

So first and third students have the different gender information because of the wrong input. Need the last time (third) point data copied over the rest.

The desired output would be

> df1
   id   text   time
1   1   male  first
2   1   male second
3   1   male  third
4   2 female  first
5   2 female second
6   2 female  third
7   3 female  first
8   3 female second
9   3 female  third
10  4   male  first

Any ideas? Thanks!


Solution

  • We could use last to return the last value of 'text' which gets recycled to update the column in mutate

    library(dplyr)
    df <- df %>%
       group_by(id) %>%
       mutate(text = last(text)) %>% 
       ungroup
    

    If we want the second or third value, use nth and modify the n to take the mininum value of 2 or the group size n() (when there are less than 2 elements per group)

    df %>% 
      group_by(id) %>%
      mutate(text = nth(text, min(c(2, n()))))