ID | Year | Maths_sub | Eng_sub | Chem_sub | Comp_sub |
---|---|---|---|---|---|
1 | 2010 | 20 | 15 | 12 | NA |
2 | 2015 | 17 | 20 | 14 | 16 |
3 | 2012 | NA | 17 | 12 | NA |
4 | 2015 | 14 | 13 | NA | NA |
5 | 2016 | NA | NA | NA | NA |
6 | 2015 | NA | NA | NA | NA |
Hello, I want to know for the above table which ID and for which Year only people with all subject columns are NA. Results should be like below:
ID | Year |
---|---|
5 | 2016 |
6 | 2015 |
I have tried something along the lines of the code below but it is not giving me the ideal answer.
df %>%
mutate(across(ends_with("sub"), is.na)) %>%
pivot_longer(-ID, names_to = "sub") %>%
filter(value) %>%
group_by(ID) %>%
summarise(`Missing Variables` = toString(sub))
Also as an extra I would like to know if there is a way to mutate across only certain columns, for example only Maths and Comp and seeing which ID and Year has both these subjects NA?
Thank you very much!
Instinctively I would use rowSums()
applied to is.na()
of the columns in question.
sub_cols <- grep("sub$", names(dat), value = TRUE)
dat[
rowSums(is.na(dat[sub_cols])) == length(sub_cols),
c("ID", "Year")
]
# ID Year
# 5 5 2016
# 6 6 2015
dplyr
approachI actually find it quite awkward to do matrix operations using dplyr
, but this is how I would do it:
library(dplyr)
dat %>%
filter(
rowSums(!is.na(select(., ends_with("sub")))) == 0
) %>%
select(ID, Year)
# ID Year
# 5 5 2016
# 6 6 2015
data.table
approachFor completeness, here's a data.table
approach. You can use patterns("sub$")
to select columns ending in in "sub"
and you can use rowSums()
with .SD
, so overall I think it's the nicest approach:
library(data.table)
setDT(dat)
dat[
dat[, rowSums(!is.na(.SD)) == 0, .SDcols = patterns("sub$")],
.(ID, Year)
]
# ID Year
# <int> <int>
# 1: 5 2016
# 2: 6 2015
dat <- structure(list(ID = 1:6, Year = c(2010L, 2015L, 2012L, 2015L,
2016L, 2015L), Maths_sub = c(20L, 17L, NA, 14L, NA, NA), Eng_sub = c(15L,
20L, 17L, 13L, NA, NA), Chem_sub = c(12L, 14L, 12L, NA, NA, NA
), Comp_sub = c(NA, 16L, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA, -6L))