Search code examples
sqlrdatabaseconnectionsummary

Get summary() of mytable from SQL database with R without writing it to my R session


I want to get a table from SQL database with R, I can do that,

df <- sqlQuery(channel, "select * from scheme.mytable1")

But then I want to get the summary function of a table ( I want to see the tables summary)

summary(df)

I can do that with connecting to SQL db and taking the table to my R session (see it as a table in R console as df) and then get the summary(df)

But the thing I want to do is, to see its summary WITHOUT writing it to my R session because the table is too big. (million columns) I can not write it to my R session.

But I want to see the summary of its columns only.

summary(sqlQuery(channel, "select * from scheme.mytable1"))

I want to do something like the above (but the code not work, just to show my purpose)

How can I do that? any package? any function ? deployer?

Thanks!


Solution

  • One thing you could do is leverage dplyr::tbl to create a remote table from the data source, and write your own summary method that does most of the computation on the DB. Here is what that might look like:

    library(dplyr)
    library(tidyr)
    
    # create a connection to table using dplyr, use dummy data 
    # in memory SQLite here,in practice would do df <- tbl(con,  "mytable1")
    con <- DBI::dbConnect(RSQLite::SQLite(), dbname = ":memory:")
    copy_to(con, ggplot2::diamonds, "diamonds")
    diamonds_db <- tbl(con,  "diamonds")
    
    # create custom summary method for the connection
    summary.tbl_SQLiteConnection <- function(object, ...) {
      suppressMessages({
        nums <- object %>% 
          summarise_if(
            is.numeric, 
            # can put whatever functions here
            # these should work for most DBs
            list(Min = min, Max = max, Mean = mean)
          ) %>% 
          collect() %>% 
          pivot_longer(everything()) %>% 
          mutate(name = stringi::stri_reverse(name)) %>% 
          separate(name, into = c("fun", "var"), sep = "_", extra = "merge") %>% 
          mutate_at(vars(var, fun), stringi::stri_reverse) %>% 
          mutate(val = paste0(fun, ": ", round(value, 2))) %>% 
          split(.$var) %>% 
          lapply(`[[`, "val")
        other_nms <- object %>% 
          select_if(~!is.numeric(.x)) %>% 
          head(1) %>% 
          collect() %>% 
          names()
        other <- lapply(other_nms, function(x) {
          x <- ensym(x)
          object %>% 
            group_by(!!x) %>% 
            tally() %>%
            arrange(desc(n)) %>%
            # only show top 10 values
            head(10) %>%
            collect() %>% 
            mutate(tmp = paste0(!!x, ": ", n)) %>% 
            pull(tmp)
        })
      })
      names(other) <- other_nms
      lst <- c(other, nums)
      # format similar to summary.data.frame 
      mat <- matrix(NA_character_, max(sapply(lst, length)), length(lst))
      for (i in seq_along(lst)) {
        for (j in seq_along(mat[, i])) {
          mat[j, i] <- lst[[i]][j]
        }
      }
      colnames(mat) <- names(lst)
      rownames(mat) <- rep("", dim(mat)[1])
      structure(mat, class = "table")
    }
    
    summary(diamonds_db)
    #> cut              color    clarity    carat     depth       price        table       x          y          z         
    #> Fair: 1610       D: 6775  I1: 741    Min: 0.2  Min: 43     Min: 326     Min: 43     Min: 0     Min: 0     Min: 0    
    #> Good: 4906       E: 9797  IF: 1790   Max: 5.01 Max: 79     Max: 18823   Max: 95     Max: 10.74 Max: 58.9  Max: 31.8 
    #> Ideal: 21551     F: 9542  SI1: 13065 Mean: 0.8 Mean: 61.75 Mean: 3932.8 Mean: 57.46 Mean: 5.73 Mean: 5.73 Mean: 3.54
    #> Premium: 13791   G: 11292 SI2: 9194                                                                                 
    #> Very Good: 12082 H: 8304  VS1: 8171                                                                                 
    #>                  I: 5422  VS2: 12258                                                                                
    #>                  J: 2808  VVS1: 3655                                                                                
    #>                           VVS2: 5066 
    

    I tried this on a table too big to fit in memory and it worked. This function doesn't exactly match what's given by summary.data.frame, but should be a start.