Search code examples
razuredatabricksazure-databrickssparkr

How to store SparkR result into an R object?


Still new to the world of Azure Databricks, the use of SparkR remains very obscure to me, even for very simple tasks...

It took me a very long time to find how to count distinct values, and I'm not sure it's the right way to go :

library(SparkR)
sparkR.session()

DW <- sql("select * from db.mytable")
nb.var <- head(summarize(DW, n_distinct(DW$VAR)))

I thought I found, but nb.per is not an object, but still a dataframe...

class(nb.per)
[1] "data.frame"

I tried :

nb.per <- as.numeric(head(summarize(DW, n_distinct(DW$PERIODE))))

It seems ok, but I'm pretty sure there is a better way to achieve this ?

Thanks !


Solution

  • Since you are anyway using Spark SQL, a very simple approach would be to do like this:
    nb.per <- `[[`(SparkR::collect(SparkR::sql("select count(distinct VAR) from db.mytable")), 1).
    And using SparkR APIs like:

    DW <- SparkR::tableToDF("db.mytable")
    nb.per <- `[[`(SparkR::collect(SparkR::agg(DW, SparkR::countDistinct(SparkR::column("VAR")))), 1)