r apache-spark apache-spark-sql sparklyr

Sparklyr "NoSuchTableException" error after subsetting data

I am new to sparklyr and haven't had any formal training - which will become obvious after this question. I'm also more on the statistician side of the spectrum which isn't helping. I'm getting an error after sub-setting a Spark DataFrame.

Consider the following example:

library(sparklyr)
library(dplyr)

sc <- spark_connect(master = "local[*]")
iris_tbl <- copy_to(sc, iris, name="iris", overwrite=TRUE)

#check column names
colnames(iris_tbl)

#subset so only a few variables remain  
subdf <- iris_tbl %>%
  select(Sepal_Length,Species)

subdf <- spark_dataframe(subdf)

#error happens when I try this operation
spark_session(sc) %>% 
  invoke("table", "subdf")

The error I'm getting is:

Error: org.apache.spark.sql.catalyst.analysis.NoSuchTableException
        at org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:122)
        at org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:122)

There are several other lines of the error.

I don't understand why I'm getting this error. "subdf" is a Spark DataFrame.

Solution

To understand why this doesn't work you have to understand what happens when you copy_to. Internally sparklyr will register temporary table using Spark metastore and treat it more or less like just another database. This is why:

spark_session(sc) %>% invoke("table", "iris")

can find the "iris" table:

<jobj[32]>
  class org.apache.spark.sql.Dataset
  [Sepal_Length: double, Sepal_Width: double ... 3 more fields]

subdf from the other hand is just plain local object. It is not registered in the metastore hence it cannot be accessed using Spark catalog. To make it work you can register Spark DataFrame:

subdf <- iris_tbl %>% 
  select(Sepal_Length, Species)

spark_dataframe(subdf) %>%
  invoke("createOrReplaceTempView", "subdf")

or copy_to if data is small enough to be handled by the driver:

subdf <- iris_tbl %>% 
  select(Sepal_Length, Species) %>% 
  copy_to(sc, ., name="subdf", overwrite=TRUE)

If you work with Spark 1.x createOrReplaceTempView should be replaced with registerTempTable.