Search code examples
rdataframesparkr

SparkR: How to extract rows that contain null values in a specific column


Disclaimer: I'm little to no experience with SparkR

Take the following dataframe:

ID          Date1       Date2
58844880    04/11/16    NaN
59745846    04/12/16    04/14/16
59743311    04/13/16    NaN
59745848    04/14/16    04/11/16
59598413    NaN         NaN
59745921    04/14/16    04/14/16
59561199    04/15/16    04/15/16
NaN         04/16/16    04/16/16
59561198    NaN         04/17/16

I'd like to just grab the rows that have NaN in the Date 2 column.

In R, I would use new_DF <- DF[is.na(DF$Var),] but I'm not sure what to do in SparkR


Solution

  • Here is one option using sparklyr

    library(sparklyr)
    library(dplyr)
    
    con <- spark_connect(master = "local")
    DF1 = copy_to(con, DF)
    
    
    DF1 %>%
       mutate_at(vars(matches("Date")), 
              funs(to_date(from_unixtime(unix_timestamp(., "MM/dd/yy"))))) %>%
       filter(is.na(Date2)) %>%
       collect()
    # A tibble: 3 x 3
    #        ID Date1      Date2     
    #     <dbl> <date>     <date>    
    #1 58844880 2016-04-10 NA        
    #2 59743311 2016-04-12 NA        
    #3 59598413 NA         NA      
    
    spark_disconnect(con)
    

    data

    DF <- structure(list(ID = c(58844880, 59745846, 59743311, 59745848, 
     59598413, 59745921, 59561199, NaN, 59561198), Date1 = c("04/11/16", 
     "04/12/16", "04/13/16", "04/14/16", "NaN", "04/14/16", "04/15/16", 
     "04/16/16", "NaN"), Date2 = c("NaN", "04/14/16", "NaN", "04/11/16", 
     "NaN", "04/14/16", "04/15/16", "04/16/16", "04/17/16")), .Names = c("ID", 
     "Date1", "Date2"), class = "data.frame", row.names = c(NA, -9L
     ))