Search code examples
rapache-sparkdplyrsparkrsparklyr

Last Rows from a Spark DataFrame (using sparklyr & dplyr)


Would like to print the last 50 rows of the below dataframe using something like the tail function below by row range using sparklyr, without arrange or collect -- some of my frames are large and don't have sequential columns.

library(sparklyr)
library(dplyr)
library(Lahman)

spark_install(version = "2.0.0")
sc <- spark_connect(master = "local")

batting_tbl <- copy_to(sc, Lahman::Batting, "batting"); batting_tbl
batting_tbl %>% count # Number of rows 
    #   n
    #  <dbl>
    #   1 101332

batting_tbl %>% tail(., n = 50)
# Error: tail() is not supported by sql sources

Solution

  • Here's one solution (returns unsorted tail):

    tbl_df(batting_tbl) %>% slice(101282:101332) # Prints the last 50 rows
    

    Here's a second solution (filters index):

    tbl_df(batting_tbl) %>% arrange(-as.numeric(rownames(.))) %>% head(., n = 50)
    

    **Note: Both of the above require tbl_df, whereas batting_tbl %>% head(., n = 50) does not require collection to an R data.frame, and tends to take less time to compute. Thanks to @user6910411 for pointing out monotonically_increasing_id() or something similar will return a Spark Data Frame instead of an R data.frame returned by collect().

    This returns the tail (or close to it), with the last index values printed first, and avoids collection to an R data.frame:

    sdf_with_unique_id(batting_tbl, id = "id") %>% arrange(-id) # Id column for sorting