Still new to the world of Azure Databricks, the use of SparkR remains very obscure to me, even for very simple tasks...
It took me a very long time to find how to count distinct values, and I'm not sure it's the right way to go :
library(SparkR)
sparkR.session()
DW <- sql("select * from db.mytable")
nb.var <- head(summarize(DW, n_distinct(DW$VAR)))
I thought I found, but nb.per is not an object, but still a dataframe...
class(nb.per)
[1] "data.frame"
I tried :
nb.per <- as.numeric(head(summarize(DW, n_distinct(DW$PERIODE))))
It seems ok, but I'm pretty sure there is a better way to achieve this ?
Thanks !
Since you are anyway using Spark SQL, a very simple approach would be to do like this:
nb.per <- `[[`(SparkR::collect(SparkR::sql("select count(distinct VAR) from db.mytable")), 1)
.
And using SparkR APIs like:
DW <- SparkR::tableToDF("db.mytable")
nb.per <- `[[`(SparkR::collect(SparkR::agg(DW, SparkR::countDistinct(SparkR::column("VAR")))), 1)