Search code examples
rlinear-regressionsparkrsparklyr

Is there a way to display standard errors with ml_linear_regression in sparklyr?


When running a linear regression using sparklyr, such as:

cached_cars %>%
  ml_linear_regression(mpg ~ .) %>%
  summary()

The results do not include standard errors

Deviance Residuals:
     Min       1Q   Median       3Q      Max 
-3.47339 -1.37936 -0.06554  1.05105  4.39057 

Coefficients:
(Intercept) cyl_cyl_8.0 cyl_cyl_4.0        disp          hp        drat
16.15953652  3.29774653  1.66030673  0.01391241 -0.04612835  0.02635025
          wt        qsec          vs          am       gear        carb 
 -3.80624757  0.64695710  1.74738689  2.61726546 0.76402917  0.50935118  

R-Squared: 0.8816
Root Mean Squared Error: 2.041
  1. Is there a way to display standard errors when running this regression?
  2. Is there a way to cluster standard errors in sparklyr?
  3. I have also been trying to run a linear model with multiple group fixed effects in sparklyr. In base R, I have done so with felm. Does anyone have experience doing this in sparklyr?

Solutions using SparkR are also highly appreciated.


Solution

  • I received a useful answer to my first question at community.rstudio.com.

    The answer from yitaoli is the following:

    library(sparklyr)
    
    spark_version <- "2.4.4" # This is the version of Spark I ran this example code with,
    # but I think everything that follows should work in all versions of Spark anyways
    
    sc <- spark_connect(master = "local", version = spark_version)
    
    cached_cars <- copy_to(sc, mtcars)
    model <- cached_cars %>%
      ml_linear_regression(mpg ~ .)
    
    coeff_std_errs <- invoke(model$model$.jobj, "summary") %>%
      invoke("coefficientStandardErrors")
    
    print(coeff_std_errs)