Search code examples
rdataframedata-cleaning

Removing the whole IDs with partial missing values in the data frame by R codes


In the following sample data frame (image), I want to remove all "pid_old" variables for the same numbers if there is a missing value in other columns related to the same ID, even for one year. For example in the 8th line, the value for "wage" is missing. Therefore, I have to remove all "pid_old" which are "2". I will be thankful if anybody helps me how to write the code for this form of cleaning the data frame in R.

enter image description here


Solution

  • You can do this with tidyverse:

    library(tidyverse)
    a <- tibble(
      col1 = c("a", NA, "b", "a", "a", "a"),
      col2 = c(1,2,3, 4, 5, NA),
      pid_old = c(1,2,2,3,4,4))
    
    `%notin%` <- Negate(`%in%`)
    
    a %>% filter(
      pid_old %notin% (a %>% 
                         filter_all(any_vars(is.na(.))) %>% 
                         pull(pid_old))
    
    

    Please post a reporducible example next time. You can do this with posting the output of dput(yourdata).

    Explanation:

    Extract a vector of pid_old values which contain any NA values.

    a %>% filter_all(any_vars(is.na(.))) %>% pull(pid_old)

    Filter out the pid_old values which are in the above vector.

    a %>% filter( pid_old %notin% c())

    This line:

    `%notin%` <- Negate(`%in%`)
    

    is credited to https://www.r-bloggers.com/the-notin-operator/