When I open a dataset (.csv) in Spark environment with spark_read_csv and ask for the dimensions of the related tibble object, there is not the number of rows but NA. What is missing when I open the csv file?
Here is what I obtain:
data = spark_read_csv(
spark_conn, name = "Advert", path = "/path/to/file",
header = TRUE, delimiter = ","
)
dim(data)
[1] NA 5
In general when you work with data backed by a database or a database-like system, the number of rows cannot be determined without full or partial evaluation of the query, and paying the price of such operation.
In case of Spark it could mean fetching data from a remote storage, parsing and aggregating.
Because of that nrow
(same as some other operations which are designed with in-memory data in mind) in dplyr
/ dbplyr
always returns NA
.
Instead you can use dplyr::summarise
with n
df <- copy_to(sc, iris)
df %>% summarise(n=n())
# Source: spark<?> [?? x 1]
n
<dbl>
1 150
dplyr::count
df %>% count()
# Source: spark<?> [?? x 1]
n
<dbl>
1 150
or sparklyr::sdf_nrow
:
df %>% sparklyr::sdf_nrow()
[1] 150
where the last option is probably what you're looking for.