Search code examples
rsparklyr

How to extract the column names which doesn't have any null values in a dataset in r/sparklyr?


I want to extract only the column names that doesn't have null values in a large dataset in r.

If my table has 4 columns (id, Price, Product, Status) with columns Price and Status having a few null values and columns id and Product with no null values. Then I would want my output as: id, Product


Solution

  • If you need an exact answer you'll have to scan full dataset first, to count missing values:

    library(dplyr)
    
    df <- copy_to(sc, tibble(
      id = 1:4,  Price = c(NA, 3.20, NA, 42),
      Product = c("p1", "p2", "p3", "p4"),
      Status = c(NA, "foo", "bar", NA)))
    
    null_counts <- df %>% 
        summarise_all(funs(sum(as.numeric(is.na(.)), na.rm=TRUE))) %>% 
        collect() 
    
    null_counts
    
    # A tibble: 1 x 4
         id Price Product Status
      <dbl> <dbl>   <dbl>  <dbl>
    1     0     2       0      2
    

    determine which columns have missing count equal to zero:

    cols_without_nulls <- null_counts %>% 
      select_if(funs(. == 0)) %>% 
      colnames()
    
    cols_without_nulls
    
    [1] "id"      "Product"
    

    and use these to select

    df %>% select(one_of(cols_without_nulls))
    
    # Source: spark<?> [?? x 2]
         id Product
      <int> <chr>  
    1     1 p1     
    2     2 p2     
    3     3 p3     
    4     4 p4 
    

    A shorter variant exists:

    df %>% select_if(funs(sum(as.numeric(is.na(.)), na.rm=TRUE) == 0))
    
    Applying predicate on the first 100 rows
    # Source: spark<?> [?? x 2]
         id Product
      <int> <chr>  
    1     1 p1     
    2     2 p2     
    3     3 p3     
    4     4 p4    
    

    but as you see it will only sample the data.