Search code examples
rdplyrtidyr

Return list of which column ID and year has missing (NA) data only for all the other columns in R


ID Year Maths_sub Eng_sub Chem_sub Comp_sub
1 2010 20 15 12 NA
2 2015 17 20 14 16
3 2012 NA 17 12 NA
4 2015 14 13 NA NA
5 2016 NA NA NA NA
6 2015 NA NA NA NA

Hello, I want to know for the above table which ID and for which Year only people with all subject columns are NA. Results should be like below:

ID Year
5 2016
6 2015

I have tried something along the lines of the code below but it is not giving me the ideal answer.

df %>%
  mutate(across(ends_with("sub"), is.na)) %>%  
  pivot_longer(-ID, names_to = "sub") %>%  
  filter(value) %>% 
  group_by(ID) %>%
  summarise(`Missing Variables` = toString(sub))

Also as an extra I would like to know if there is a way to mutate across only certain columns, for example only Maths and Comp and seeing which ID and Year has both these subjects NA?

Thank you very much!


Solution

  • base R approach

    Instinctively I would use rowSums() applied to is.na() of the columns in question.

    sub_cols <- grep("sub$", names(dat), value = TRUE)
    
    dat[
        rowSums(is.na(dat[sub_cols])) == length(sub_cols),
        c("ID", "Year")
    ]
    #   ID Year
    # 5  5 2016
    # 6  6 2015
    

    dplyr approach

    I actually find it quite awkward to do matrix operations using dplyr, but this is how I would do it:

    library(dplyr)
    dat %>%
        filter(
            rowSums(!is.na(select(., ends_with("sub")))) == 0
        ) %>%
        select(ID, Year)
    #   ID Year
    # 5  5 2016
    # 6  6 2015
    

    data.table approach

    For completeness, here's a data.table approach. You can use patterns("sub$") to select columns ending in in "sub" and you can use rowSums() with .SD, so overall I think it's the nicest approach:

    library(data.table)
    setDT(dat)
    
    dat[
        dat[, rowSums(!is.na(.SD)) == 0, .SDcols = patterns("sub$")],
        .(ID, Year)
    ]
    #       ID  Year
    #    <int> <int>
    # 1:     5  2016
    # 2:     6  2015
    

    Data

    dat <- structure(list(ID = 1:6, Year = c(2010L, 2015L, 2012L, 2015L, 
    2016L, 2015L), Maths_sub = c(20L, 17L, NA, 14L, NA, NA), Eng_sub = c(15L, 
    20L, 17L, 13L, NA, NA), Chem_sub = c(12L, 14L, 12L, NA, NA, NA
    ), Comp_sub = c(NA, 16L, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA, -6L))