Search code examples
rmissing-data

How to add Zeros where observations are missing


Here's an example of some data that I have

 dput(df)
structure(list(ID = c("a", "b", "c", "d", "e", "f", "g", "h", 
"i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "b", 
"c", "d", "e", "f", "j", "k", "n", "m", "q", "r"), Number = c(1, 
2, 1, 3, 4, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 
1, 2, 1, 1, 1, 1, 1, 2, 2), Location = c(1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2), Surveyor = c("JKK", "JKK", "JKK", "JKK", "JKK", "JKK", 
"JKK", "JKK", "JKK", "JKK", "JKK", "JKK", "JKK", "JKK", "JKK", 
"JKK", "JKK", "JKK", "JKK", "JKK", "JKK", "JKK", "JKK", "JKK", 
"JKK", "JKK", "JKK", "JKK", "JKK", "JKK", "JKK")), row.names = c(NA, 
-31L), spec = structure(list(cols = list(ID = structure(list(), class = c("collector_character", 
"collector")), Number = structure(list(), class = c("collector_double", 
"collector")), Location = structure(list(), class = c("collector_double", 
"collector")), Surveyor = structure(list(), class = c("collector_character", 
"collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), delim = ","), class = "col_spec"), problems = <pointer: 0x00000253510611f0>, class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"))

In the above example I have the col ID which contains the letters a-t (my real data have other letter codes), in Location 1 all of these codes are present. However in Location 2 some of them are missing.

I would like to add 0 into the Number col where there are missing IDs.

My real data has 25 locations so ideally I'd like to be able to grab the full list of IDs and check that against each Location and add 0 where they are missing.

I've tried tidyverse::mutate with case_when but I'm not getting anywhere. Any help appreciated.


Solution

  • You can use tidyr::complete() for this. The fill parameter allows you to set the values in the columns you create.

    df  |> 
        tidyr::complete(
            ID,
            Location, 
            fill = list(Number = 0)
        )
    

    For the new values, the output will have 0 for Number and NA for any columns not explicitly set, i.e. Surveyor.

    Output

    I assigned the above to a variable called new_df so you can easily compare with the original df:

    dplyr::anti_join(new_df, df, by = c("ID", "Location"))
    
    # A tibble: 9 × 4
      ID    Location Number Surveyor
      <chr>    <dbl>  <dbl> <chr>   
    1 a            2      0 NA      
    2 g            2      0 NA      
    3 h            2      0 NA      
    4 i            2      0 NA      
    5 l            2      0 NA      
    6 o            2      0 NA      
    7 p            2      0 NA      
    8 s            2      0 NA      
    9 t            2      0 NA