In the following sample data frame (image), I want to remove all "pid_old" variables for the same numbers if there is a missing value in other columns related to the same ID, even for one year. For example in the 8th line, the value for "wage" is missing. Therefore, I have to remove all "pid_old" which are "2". I will be thankful if anybody helps me how to write the code for this form of cleaning the data frame in R.
You can do this with tidyverse:
library(tidyverse)
a <- tibble(
col1 = c("a", NA, "b", "a", "a", "a"),
col2 = c(1,2,3, 4, 5, NA),
pid_old = c(1,2,2,3,4,4))
`%notin%` <- Negate(`%in%`)
a %>% filter(
pid_old %notin% (a %>%
filter_all(any_vars(is.na(.))) %>%
pull(pid_old))
Please post a reporducible example next time. You can do this with posting the output of dput(yourdata)
.
Explanation:
Extract a vector of pid_old values which contain any NA
values.
a %>% filter_all(any_vars(is.na(.))) %>% pull(pid_old)
Filter out the pid_old values which are in the above vector.
a %>% filter( pid_old %notin% c())
This line:
`%notin%` <- Negate(`%in%`)
is credited to https://www.r-bloggers.com/the-notin-operator/