Search code examples
rapache-sparkuser-interfacerstudiosparklyr

In RStudio, can I visually preview Spark Dataframes in the GUI like I can with normal R dataframes?


Background

This may be my lack of skill showing, but as I'm working on data manipulation in R, using RStudio, I'm fond of clicking into dataframes in the "Environments" section of the GUI (for me it's in the top-right of the screen) to see how my joins, mutates, etc. are changing the table(s) as I move through my workflow. It acts as a visual sanity check for me; when it comes to tables and dataframes I'm a very visual thinker, and I like to see my results as I code. As an example, I click on this:

enter image description here

And see something like this:

enter image description here

The Problem

Lately, because of a very large dataset (~200m rows), I've needed to do some of my dplyr work inside sparklyr, using a local instance of Apache Spark to work through some data manipulation. It's working mostly fine, but I lose my ability to have little previews of the data because spark dataframe objects look like lists in the Environment pane:

enter image description here

Besides clicking, is there a way I can "preview" my Spark dataframes inside RStudio as I work on them?

What I've tried

So your first thought might be "just use head()" -- and you'd be right! Except that running head(d1, 5) on a local Spark df with 200 million rows takes ... a long time.

Anything I may be missing?


Solution

  • Generally, I believe you need to call collect() on the Spark dataframe. So I would first sample the Spark dataframe, say .001% of the rows (if there's 200 million) with the sparklyr::sdf_sample function, and then collect that sample into a regular dataframe to look at.

    samp <- analysis_test %>% sdf_sample(fraction = .00001) %>% collect()