Search code examples
rdataframerowsnais-empty

R delete Dataframe columns with specific rows conditions


I have a dataframe with empty and NA values, like that:

> DF_datos
 
   V1  V2  V3  V4  V5 V6 
3 aaa aaa             NA
4  bb  bb  70  80     NA
5  cc  cc  80  80     NA
6 ddd ddd             NA  

I would like to delete all columns that contains an empty or NA value on its second or third row, without a loop over the dataframe. With subset function or something similar... The result I want would be:

> DF_datos
 
   V1  V2  V3  V4 
3 aaa aaa           
4  bb  bb  70  80   
5  cc  cc  80  80    
6 ddd ddd     

     
                                         

Solution

  • We can use tidyverse with select from dplyr. In select, specify the logical expression in where to check if there are any non-NA (!is.na(.)) and any non-blank (nzchar) elements in the column to be selected

    library(dplyr)
    DF_datos %>%
          select(where(~ any(!is.na(.))&any(nzchar(.))))
    

    -output

    #   V1  V2 V3 V4
    #3 aaa aaa      
    #4  bb  bb 70 80
    #5  cc  cc 80 80
    #6 ddd ddd      
    

    If we need to only check the 2nd or 3rd row

    DF_datos %>%
      slice(2:3) %>% 
      select(where(~ any(!is.na(.))&any(nzchar(.)))) %>% 
      names %>% 
      select(DF_datos, .)
    

    Or with Filter from base R (R 4.1.0) using the same logic

    Filter(\(x) any(!is.na(x)) & any(nzchar(x)), DF_datos)
    

    -output

    #   V1  V2 V3 V4
    #3 aaa aaa      
    #4  bb  bb 70 80
    #5  cc  cc 80 80
    #6 ddd ddd      
    

    Or for rows 2 and 3

    Filter(\(x) any(!is.na(x)) & any(nzchar(x)), DF_datos[2:3,]) |>
          names() |>
          {\(x) subset(DF_datos, select = x)}()
    

    Or use sum instead of any by checking if the sum of the compound logical expression is greater than 0

    Filter(\(x) sum(!is.na(x) & nzchar(x)) > 0, DF_datos)
    

    In earlier R versions use

    Filter(function(x) any(!is.na(x)) & any(nzchar(x)), DF_datos)
    

    NOTE: All of the above options are efficient as this loops over the columns and is memory efficient as this won't apply the expression on the whole dataset

    Update

    Based on the comments, the OP wanted to delete columns if there are any NA or blank in rows 2 or 3.

    DF_datos$V6 <- c(NA, NA, 80, NA)
    
    DF_datos %>%
      slice(2:3) %>% 
      select(where(~ all(!is.na(.)) & all(nzchar(.)))) %>% names %>% 
      select(DF_datos, .)
    

    -output

      V1  V2 V3 V4
    3 aaa aaa      
    4  bb  bb 70 80
    5  cc  cc 80 80
    6 ddd ddd      
    

    Or using Filter

    Filter(\(x) all(!is.na(x)) & all(nzchar(x)), DF_datos[2:3,]) |>
           names() |>
           {\(x) subset(DF_datos, select = x)}()
    #   V1  V2 V3 V4
    #3 aaa aaa      
    #4  bb  bb 70 80
    #5  cc  cc 80 80
    #6 ddd ddd      
    

    data

    DF_datos <- structure(list(V1 = c("aaa", "bb", "cc", "ddd"), V2 = c("aaa", 
    "bb", "cc", "ddd"), V3 = c("", "70", "80", ""), V4 = c("", "80", 
    "80", ""), V5 = c("", "", "", ""), V6 = c(NA, NA, NA, NA)), row.names = c("3", 
    "4", "5", "6"), class = "data.frame")