I am using sparklyr
to run some analysis, but I am interested also in writing raw SQL
queries using DBI
.
I am able to run the following query
query <- "SELECT col1, FROM mydata WHERE some_condition"
dataframe <- dbGetQuery(spark_connection, query)
but this returns the data into R (in a dataframe).
What I want instead is keep the data inside Spark and store it in another Spark Dataframe for further interaction with sparklyr
.
Any ideas?
The issue with using DBI
is memory. You wont be able to fetch a huge amount of data with that. If your query results return a huge amount of data, the will overwhelm spark's driver memory and cause out of memory errors...
What's happening with sparklyr is the following. DBI
runs the sql
command a returns an R DataFrame
which means it is collecting the data to materialize it in a regular R context.
Thus if you want to use it to return small dataset, you don't need spark for the matter.
Then DBI isn't the solution for you; you ought using regular SparkR
if you want to stick with R for that.
This is an example on how you can use the sql
in sparkr
:
sc %>% spark_session %>%
invoke("sql", "SELECT 1") %>%
invoke("createTempView", "foo")