I just know how to connect R to Redshift using RPostgreSQL and dplyr. But I am confused about a couple of things:
Have you read the dplyr vignette on databases? You really should read that for much more detail.
You seem to be confounding a connection between R and a database and code execution. A connection allows communication between R and a database - R can send SQL queries to the db and the db can send results back. Nothing more.
dplyr
attempts to translate some R code to SQL code so that the SQL can execute on the database and you get the results without having to write SQL yourself. As the vignette explains, there are very few R commands that have translation available. The main dplyr
verbs work, i.e., select
, mutate
(though not with grouped data frames for SQLite, however this is available with postgres), group_by
, summarize
, arrange
, filter
. Also basic math functions (quoting from the vignette):
- basic math operators:
+, -, *, /, %%, ^
- math functions:
abs, acos, acosh, asin, asinh, atan, atan2, atanh, ceiling, cos, cosh, cot, coth, exp, floor, log, log10, round, sign, sin, sinh, sqrt, tan, tanh
- logical comparisons:
<, <=, !=, >=, >, ==, %in%
- boolean operations:
&, &&, |, ||, !, xor
- basic aggregations:
mean, sum, min, max, sd, var
Anything else, and you will pulling the data into R and running the R command on it. In the case of head()
, which is not on the list, you are probably pulling the whole table into R and then looking at the first 6 rows.
For Redshift this will be particularly slow for a wide table - Redshift uses columnar storage, so there are strong performance benefits to only pulling the columns you need.
Also, as explained in the vignette, dplyr
will put off the SQL evaluation as long as possible, which improves efficiency. The intended way to bring results to R is collect
, but your as_data_frame
sounds like it's working as well.