Search code examples
rdataframegroupingsubsetmultiple-columns

R subset dataframe where no observations of certain variables


I have a dataframe that looks like

country sector data1 data2
France 1 7 .
France 2 10 .
belgium 1 12 7
belgium 2 14 8

I want to subset columns that are missing for a country in all sectors. In this example I would like to drop/exclude column two because it is missing for sector 1 and 2 for france. To be clear I would also be throwing out the values of data2 for belgium in this example.

My expected output would look like

country sector data1
France 1 7
France 2 10
belgium 1 12
belgium 2 14

data 2 is now excluded because it had a complete set of missing values for all sectors in France


Solution

  • We may group by country, create logical columns where the count of NA elements are equal to group size, ungroup, replace the corresponding columns to NA based on the logical column and remove those columns in select

    library(dplyr)
    library(stringr)
    df1 %>% 
      group_by(country) %>% 
      mutate(across(everything(), ~ sum(is.na(.x)) == n(),
           .names = "{.col}_lgl")) %>% 
      ungroup %>% 
      mutate(across(names(df1)[-1], ~ if(any(get(str_c(cur_column(), 
         "_lgl")) )) NA else .x)) %>% 
       select(c(where(~ !is.logical(.x) && any(complete.cases(.x)))))
    

    -output

    # A tibble: 4 × 3
      country sector data1
      <chr>    <int> <int>
    1 France       1     7
    2 France       2    10
    3 belgium      1    12
    4 belgium      2    14
    

    If we don't use group_by, the steps can be simplified as showed in Maël's post i.e. do the grouping with a base R function within select i.e. either tapply or ave can work

    df1 %>%
       select(where(~ !any(tapply(is.na(.x), df1[["country"]], 
         FUN = all))))
    

    data

    df1 <- structure(list(country = c("France", "France", "belgium", "belgium"
    ), sector = c(1L, 2L, 1L, 2L), data1 = c(7L, 10L, NA, 14L), data2 = c(NA, 
    NA, 7L, 8L)), row.names = c(NA, -4L), class = "data.frame")