Would like to print the last 50 rows of the below dataframe using something like the tail function
below by row range using sparklyr, without arrange
or collect
-- some of my frames are large and don't have sequential columns.
library(sparklyr)
library(dplyr)
library(Lahman)
spark_install(version = "2.0.0")
sc <- spark_connect(master = "local")
batting_tbl <- copy_to(sc, Lahman::Batting, "batting"); batting_tbl
batting_tbl %>% count # Number of rows
# n
# <dbl>
# 1 101332
batting_tbl %>% tail(., n = 50)
# Error: tail() is not supported by sql sources
Here's one solution (returns unsorted tail):
tbl_df(batting_tbl) %>% slice(101282:101332) # Prints the last 50 rows
Here's a second solution (filters index):
tbl_df(batting_tbl) %>% arrange(-as.numeric(rownames(.))) %>% head(., n = 50)
**Note: Both of the above require tbl_df
, whereas batting_tbl %>% head(., n = 50)
does not require collection to an R data.frame, and tends to take less time to compute. Thanks to @user6910411 for pointing out monotonically_increasing_id()
or something similar will return a Spark Data Frame instead of an R data.frame returned by collect()
.
This returns the tail (or close to it), with the last index values printed first, and avoids collection to an R data.frame:
sdf_with_unique_id(batting_tbl, id = "id") %>% arrange(-id) # Id column for sorting