Sparklyr dimension issue with spark_read_csv : NA result

When I open a dataset (.csv) in Spark environment with spark_read_csv and ask for the dimensions of the related tibble object, there is not the number of rows but NA. What is missing when I open the csv file?

Here is what I obtain:

data = spark_read_csv(
  spark_conn, name = "Advert", path = "/path/to/file", 
  header = TRUE, delimiter = ","
)

dim(data)

[1] NA  5

Solution

In general when you work with data backed by a database or a database-like system, the number of rows cannot be determined without full or partial evaluation of the query, and paying the price of such operation.

In case of Spark it could mean fetching data from a remote storage, parsing and aggregating.

Because of that nrow (same as some other operations which are designed with in-memory data in mind) in dplyr / dbplyr always returns NA.

Instead you can use dplyr::summarise with n

df <- copy_to(sc, iris)

df %>% summarise(n=n())

# Source: spark<?> [?? x 1]
      n
  <dbl>
1   150

dplyr::count

df %>% count()

# Source: spark<?> [?? x 1]
      n
  <dbl>
1   150

or sparklyr::sdf_nrow:

df %>% sparklyr::sdf_nrow()

[1] 150

where the last option is probably what you're looking for.