Search code examples
rdplyrdbplyr

tail() equivalent using dbplyr? (i.e. return last x rows of database table)


Suppose using dbplyr, we have something like

library(dbplyr)

sometable %>%
  head()

then we see the first 6 rows.

But if we try this we see an error

sometable %>%
  tail()
# Error: tail() is not supported by sql sources

which is expected behaviour of dbplyr:

Because you can’t find the last few rows without executing the whole query, you can’t use tail().

Question: how do we do the tail() equivalent in this situation?


Solution

  • In general, the order of SQL queries should never be assumed, as the DBMS may store it in an order that is ideal for indexing or other reasons, and not based on the order you want. Because of that, a common "best practice" for SQL queries is to either (a) assume the data is unordered (and perhaps that the order may change, though I've not seen this in practice); or (b) force ordering in the query.

    From this, consider arranging your data in a descending manner and use head.

    For instance, if I have a table MyTable with a numeric field MyNumber, then

    library(dplyr)
    library(dbplyr)
    tb <- tbl(con, "MyTable")
    tb %>%
      arrange(MyNumber) %>%
      tail() %>%
      sql_render()
    # Error: tail() is not supported by sql sources
    tb %>%
      arrange(MyNumber) %>%
      head() %>%
      sql_render()
    # <SQL> SELECT TOP(6) *
    # FROM "MyTable"
    # ORDER BY "MyNumber"
    tb %>%
      arrange(desc(MyNumber)) %>%
      head() %>%
      sql_render()
    # <SQL> SELECT TOP(6) *
    # FROM "MyTable"
    # ORDER BY "MyNumber" DESC
    

    (This is (obviously) demonstrated on a SQL Server connection, but the premise should work just as well for other DBMS types, they'll just shift from SELECT TOP(6) ... to SELECT ... LIMIT 6 or similar.)