Search code examples
rapache-sparksparklyr

Sparklyr dimension issue with spark_read_csv : NA result


When I open a dataset (.csv) in Spark environment with spark_read_csv and ask for the dimensions of the related tibble object, there is not the number of rows but NA. What is missing when I open the csv file?

Here is what I obtain:

data = spark_read_csv(
  spark_conn, name = "Advert", path = "/path/to/file", 
  header = TRUE, delimiter = ","
)

dim(data)
[1] NA  5

Solution

  • In general when you work with data backed by a database or a database-like system, the number of rows cannot be determined without full or partial evaluation of the query, and paying the price of such operation.

    In case of Spark it could mean fetching data from a remote storage, parsing and aggregating.

    Because of that nrow (same as some other operations which are designed with in-memory data in mind) in dplyr / dbplyr always returns NA.

    Instead you can use dplyr::summarise with n

    df <- copy_to(sc, iris)
    
    df %>% summarise(n=n())
    
    # Source: spark<?> [?? x 1]
          n
      <dbl>
    1   150
    

    dplyr::count

    df %>% count()
    
    # Source: spark<?> [?? x 1]
          n
      <dbl>
    1   150
    

    or sparklyr::sdf_nrow:

    df %>% sparklyr::sdf_nrow()
    
    [1] 150
    

    where the last option is probably what you're looking for.