Search code examples
sqlrapache-sparksparklyrr-dbi

DBI/Spark: how to store the result in a Spark Dataframe?


I am using sparklyr to run some analysis, but I am interested also in writing raw SQL queries using DBI.

I am able to run the following query

query <- "SELECT col1, FROM mydata WHERE some_condition"
dataframe <- dbGetQuery(spark_connection, query)

but this returns the data into R (in a dataframe).

What I want instead is keep the data inside Spark and store it in another Spark Dataframe for further interaction with sparklyr.

Any ideas?


Solution

  • The issue with using DBI is memory. You wont be able to fetch a huge amount of data with that. If your query results return a huge amount of data, the will overwhelm spark's driver memory and cause out of memory errors...

    What's happening with sparklyr is the following. DBI runs the sql command a returns an R DataFrame which means it is collecting the data to materialize it in a regular R context.

    Thus if you want to use it to return small dataset, you don't need spark for the matter.

    Then DBI isn't the solution for you; you ought using regular SparkR if you want to stick with R for that.

    This is an example on how you can use the sql in sparkr :

    sc %>% spark_session %>% 
       invoke("sql", "SELECT 1") %>% 
       invoke("createTempView", "foo")