Search code examples
rdplyramazon-redshiftrpostgresql

what is the theory of connecting R to Redshift using RPostgreSQL and dplyr


I just know how to connect R to Redshift using RPostgreSQL and dplyr. But I am confused about a couple of things:

  1. When I want to show something on the screen, like head(a) where a is a result retrieving from Redshift, it is gonna be really slow. However, if I just picked a couple of fields, let's say just 4 columns, then it is much faster. So how does R connect Redshift?
  2. If I want to use some models for example Random forest or generalized linear, it is necessary to use as.data.frame to transfer the result to a data frame on my local machine? I have tried. I must do this before I use ggplot2 to draw charts.

Solution

  • Have you read the dplyr vignette on databases? You really should read that for much more detail.

    You seem to be confounding a connection between R and a database and code execution. A connection allows communication between R and a database - R can send SQL queries to the db and the db can send results back. Nothing more.

    dplyr attempts to translate some R code to SQL code so that the SQL can execute on the database and you get the results without having to write SQL yourself. As the vignette explains, there are very few R commands that have translation available. The main dplyr verbs work, i.e., select, mutate (though not with grouped data frames for SQLite, however this is available with postgres), group_by, summarize, arrange, filter. Also basic math functions (quoting from the vignette):

    • basic math operators: +, -, *, /, %%, ^
    • math functions: abs, acos, acosh, asin, asinh, atan, atan2, atanh, ceiling, cos, cosh, cot, coth, exp, floor, log, log10, round, sign, sin, sinh, sqrt, tan, tanh
    • logical comparisons: <, <=, !=, >=, >, ==, %in%
    • boolean operations: &, &&, |, ||, !, xor
    • basic aggregations: mean, sum, min, max, sd, var

    Anything else, and you will pulling the data into R and running the R command on it. In the case of head(), which is not on the list, you are probably pulling the whole table into R and then looking at the first 6 rows.

    For Redshift this will be particularly slow for a wide table - Redshift uses columnar storage, so there are strong performance benefits to only pulling the columns you need.

    Also, as explained in the vignette, dplyr will put off the SQL evaluation as long as possible, which improves efficiency. The intended way to bring results to R is collect, but your as_data_frame sounds like it's working as well.