Background
This may be my lack of skill showing, but as I'm working on data manipulation in R
, using RStudio, I'm fond of clicking into dataframes in the "Environments" section of the GUI (for me it's in the top-right of the screen) to see how my joins, mutates, etc. are changing the table(s) as I move through my workflow. It acts as a visual sanity check for me; when it comes to tables and dataframes I'm a very visual thinker, and I like to see my results as I code. As an example, I click on this:
And see something like this:
The Problem
Lately, because of a very large dataset (~200m rows), I've needed to do some of my dplyr
work inside sparklyr
, using a local instance of Apache Spark to work through some data manipulation. It's working mostly fine, but I lose my ability to have little previews of the data because spark dataframe objects look like lists in the Environment pane:
Besides clicking, is there a way I can "preview" my Spark dataframes inside RStudio as I work on them?
What I've tried
So your first thought might be "just use head()
" -- and you'd be right! Except that running head(d1, 5)
on a local Spark df
with 200 million rows takes ... a long time.
Anything I may be missing?
Generally, I believe you need to call collect() on the Spark dataframe. So I would first sample the Spark dataframe, say .001% of the rows (if there's 200 million) with the sparklyr::sdf_sample
function, and then collect that sample into a regular dataframe to look at.
samp <- analysis_test %>% sdf_sample(fraction = .00001) %>% collect()