DBI/Spark: how to store the result in a Spark Dataframe?

I am using sparklyr to run some analysis, but I am interested also in writing raw SQL queries using DBI.

I am able to run the following query

query <- "SELECT col1, FROM mydata WHERE some_condition"
dataframe <- dbGetQuery(spark_connection, query)

but this returns the data into R (in a dataframe).

What I want instead is keep the data inside Spark and store it in another Spark Dataframe for further interaction with sparklyr.

Any ideas?

Solution

The issue with using DBI is memory. You wont be able to fetch a huge amount of data with that. If your query results return a huge amount of data, the will overwhelm spark's driver memory and cause out of memory errors...

What's happening with sparklyr is the following. DBI runs the sql command a returns an R DataFrame which means it is collecting the data to materialize it in a regular R context.

Thus if you want to use it to return small dataset, you don't need spark for the matter.

Then DBI isn't the solution for you; you ought using regular SparkR if you want to stick with R for that.

This is an example on how you can use the sql in sparkr :

sc %>% spark_session %>% 
   invoke("sql", "SELECT 1") %>% 
   invoke("createTempView", "foo")