Search code examples
rdplyrlong-format-data

How to compare two or more lines in a long dataset to create a new variable?


I have a long format dataset like that:

ID year Address Classification
1 2020 A NA
1 2021 A NA
1 2022 B B_
2 2020 C NA
2 2021 D NA
2 2022 F F_
3 2020 G NA
3 2021 G NA
3 2022 G G_
4 2020 H NA
4 2021 I NA
4 2022 H H_

I have a Classification of each subject in year 2022 based on their addresses in 2022. This Classification was not made in other years. But I would like to generalize this classification to other years, in a way that if their addresses in other years are the same address they hold in 2022, so the NA values from the 'Classification' variable in these years would be replaced with the same value of the 'Classification' they got in 2022.

I have been trying to convert to a wide data and compare the lines in a more direct way with dplyr. But it is not working properly, since there are these NA values. And, also, this doesn't look a smart way to achieve the final dataset I desire. I would like to get the 'Aim' column in my dataset as showed below:

ID year Address Classification Aim
1 2020 A NA NA
1 2021 A NA NA
1 2022 B B_ B_
2 2020 C NA NA
2 2021 D NA NA
2 2022 F F_ F_
3 2020 G NA G_
3 2021 G NA G_
3 2022 G G_ G_
4 2020 H NA H_
4 2021 I NA NA
4 2022 H H_ H_

Solution

  • I use tidyr::fill with dplyr::group_by for this. Here you need to specify the direction (the default is "down" which will fill with NAs since that's the first value in each group).

    library(dplyr)
    library(tidyr)
    
    df %>% 
      group_by(ID, Address) %>%
      tidyr::fill(Classification, .direction = "up")
    

    Output:

    #     ID  year Address Classification
    #   <int> <int> <chr>   <chr>         
    # 1     1  2020 A       NA            
    # 2     1  2021 A       NA            
    # 3     1  2022 B       B_            
    # 4     2  2020 C       NA            
    # 5     2  2021 D       NA            
    # 6     2  2022 F       F_            
    # 7     3  2020 G       G_            
    # 8     3  2021 G       G_            
    # 9     3  2022 G       G_            
    #10     4  2020 H       H_            
    #11     4  2021 I       NA            
    #12     4  2022 H       H_
    

    Data

    df <- read.table(text = "ID year    Address Classification
    1   2020    A   NA
    1   2021    A   NA
    1   2022    B   B_
    2   2020    C   NA
    2   2021    D   NA
    2   2022    F   F_
    3   2020    G   NA
    3   2021    G   NA
    3   2022    G   G_
    4   2020    H   NA
    4   2021    I   NA
    4   2022    H   H_", header = TRUE)