Search code examples
rdataframedplyrtidyverse

Exclude the columns whose column names contain a certain character, and drop the rows that contain NAs in the remaining columns


Suppose I have a dataframe df as below:

          date      CPI:YoY  PPI:MoM
1   2018-01-15           NA       NA
2   2018-01-31          1.2      0.2
3   2018-01-15           NA       NA
4   2018-02-28          0.9     -0.1
5   2018-03-15           NA       NA
6   2018-03-31          1.0      0.3

I need to implement the following two steps with dplyr: select the columns that do not contain date, and then drop the rows where all values in those columns are NAs.

To select columns not containing date can be implemented with the code: dplyr::select(grep("date", names(df))) or dplyr::select(contains("date", ignore.case = TRUE)); to drop NA rows in the selected columns we can use: filter_at(vars(cols), all_vars(!is.na(.))) or dplyr::drop_na(cols).

Now my question is how to combine the two codes into one? Thanks.

Note: For this example, we can simply use drop_na(CPI:YoY, PPI:MoM) to get the expected result, but since in real data, we usually have many columns, and the position of date column doesn't have to be the first column all the time, so I'd prefer to subset the columns by excluding the date column.

The expected result:

          date      CPI:YoY  PPI:MoM
1   2018-01-31          1.2      0.2
2   2018-02-28          0.9     -0.1
3   2018-03-31          1.0      0.3

References:

Subset data to contain only columns whose names match a condition

filtering data frame based on NA on multiple columns


Solution

  • You could use filter + if_any/if_all:

    df %>%
      filter( !if_all(-contains("date"), is.na) )
    
    df %>%
      filter( if_any(-contains("date"), ~ !is.na(.)) )
    
    df %>%
      filter( if_any(-contains("date"), Negate(is.na)) )
    

    Note: Negate(is.na) = function(x) !is.na(x) = ~ !is.na(.)

    Output
            date CPI:YoY PPI:MoM
    2 2018-01-31     1.2     0.2
    4 2018-02-28     0.9    -0.1
    6 2018-03-31     1.0     0.3
    
    Data
    df <- read.table(text =
    "         date      CPI:YoY  PPI:MoM
    1   2018-01-15           NA       NA
    2   2018-01-31          1.2      0.2
    3   2018-01-15           NA       NA
    4   2018-02-28          0.9     -0.1
    5   2018-03-15           NA       NA
    6   2018-03-31          1.0      0.3", check.names = FALSE)