Search code examples
rapache-sparkdataframeapache-spark-sqlsparkr

SparkR - Error in as.double(x) : cannot coerce type 'S4' to vector of type 'double'


I want to get some descriptive statistics on my data frame:

# Initialize SparkR Contexts
    library(SparkR)                                 # Load library
    sc <- sparkR.init(master="local[4]")            # Initialize Spark Context
    sqlContext <- sparkRSQL.init(sc)                # Initialize SQL Context

# Load data
df <- loadDF(sqlContext, "/outputs/merged.parquet") # Load data into Data Frame

# Filter 
df_t1 <- select(filter(df, df$t == 1 & df$totalUsers > 0 & isNotNull(df$domain)), "*")

avg_df <- collect(agg(groupBy(df_t1, "domain"), AVG=avg(df_t1$totalUsers), STD=sd(df_t1$totalUsers, na.rm = FALSE)))
head(avg_df)

I am getting this error:

Error in as.double(x) : 
  cannot coerce type 'S4' to vector of type 'double'

which is produced by sd(). I tried using var() and get Error: is.atomic(x) is not TRUE. I get no error when using just avg().

My question is different from this one because I am not using these packages, and reading this I understand that for some reason my df_t1$tutoalUsers is a type S4 instead of vector of double, so I tried casting it with no effect:

avg_df <- collect(agg(groupBy(df_t1, "domain"),AVG=avg(df_t1$totalUsers), STD=sd(cast(df_t1$totalUsers, "double"),na.rm = FALSE)))

Thoughts?

Edit: The schema is

> printSchema(df_t1)
root
 |-- created: integer (nullable = true)
 |-- firstItem: integer (nullable = true)
 |-- domain: string (nullable = true)
 |-- t: integer (nullable = true)
 |-- groupId: string (nullable = true)
 |-- email: integer (nullable = true)
 |-- chat: integer (nullable = true)

and my version of Spark is 1.5.2


Solution

  • You're using Spark 1.5 which doesn't provide more advanced statistical summaries and you cannot use standard R functions when operating on Spark DataFrame. avg() works because it is actually a Spark SQL function available in Spark 1.5.

    Additional statistical summaries have been introduced in Spark 1.6 including methods to compute standard deviation (sd, stddev stddev_samp and stddev_pop) and variance (var, variance, var_samp, var_pop). You can of course still compute standard deviation using well known formula as shown in Calculate the standard deviation of grouped data in a Spark DataFrame