I want to get a table from SQL database with R, I can do that,
df <- sqlQuery(channel, "select * from scheme.mytable1")
But then I want to get the summary function of a table ( I want to see the tables summary)
summary(df)
I can do that with connecting to SQL db and taking the table to my R session (see it as a table in R console as df) and then get the summary(df)
But the thing I want to do is, to see its summary WITHOUT writing it to my R session because the table is too big. (million columns) I can not write it to my R session.
But I want to see the summary of its columns only.
summary(sqlQuery(channel, "select * from scheme.mytable1"))
I want to do something like the above (but the code not work, just to show my purpose)
How can I do that? any package? any function ? deployer?
Thanks!
One thing you could do is leverage dplyr::tbl
to create a remote table from the data source, and write your own summary
method that does most of the computation on the DB. Here is what that might look like:
library(dplyr)
library(tidyr)
# create a connection to table using dplyr, use dummy data
# in memory SQLite here,in practice would do df <- tbl(con, "mytable1")
con <- DBI::dbConnect(RSQLite::SQLite(), dbname = ":memory:")
copy_to(con, ggplot2::diamonds, "diamonds")
diamonds_db <- tbl(con, "diamonds")
# create custom summary method for the connection
summary.tbl_SQLiteConnection <- function(object, ...) {
suppressMessages({
nums <- object %>%
summarise_if(
is.numeric,
# can put whatever functions here
# these should work for most DBs
list(Min = min, Max = max, Mean = mean)
) %>%
collect() %>%
pivot_longer(everything()) %>%
mutate(name = stringi::stri_reverse(name)) %>%
separate(name, into = c("fun", "var"), sep = "_", extra = "merge") %>%
mutate_at(vars(var, fun), stringi::stri_reverse) %>%
mutate(val = paste0(fun, ": ", round(value, 2))) %>%
split(.$var) %>%
lapply(`[[`, "val")
other_nms <- object %>%
select_if(~!is.numeric(.x)) %>%
head(1) %>%
collect() %>%
names()
other <- lapply(other_nms, function(x) {
x <- ensym(x)
object %>%
group_by(!!x) %>%
tally() %>%
arrange(desc(n)) %>%
# only show top 10 values
head(10) %>%
collect() %>%
mutate(tmp = paste0(!!x, ": ", n)) %>%
pull(tmp)
})
})
names(other) <- other_nms
lst <- c(other, nums)
# format similar to summary.data.frame
mat <- matrix(NA_character_, max(sapply(lst, length)), length(lst))
for (i in seq_along(lst)) {
for (j in seq_along(mat[, i])) {
mat[j, i] <- lst[[i]][j]
}
}
colnames(mat) <- names(lst)
rownames(mat) <- rep("", dim(mat)[1])
structure(mat, class = "table")
}
summary(diamonds_db)
#> cut color clarity carat depth price table x y z
#> Fair: 1610 D: 6775 I1: 741 Min: 0.2 Min: 43 Min: 326 Min: 43 Min: 0 Min: 0 Min: 0
#> Good: 4906 E: 9797 IF: 1790 Max: 5.01 Max: 79 Max: 18823 Max: 95 Max: 10.74 Max: 58.9 Max: 31.8
#> Ideal: 21551 F: 9542 SI1: 13065 Mean: 0.8 Mean: 61.75 Mean: 3932.8 Mean: 57.46 Mean: 5.73 Mean: 5.73 Mean: 3.54
#> Premium: 13791 G: 11292 SI2: 9194
#> Very Good: 12082 H: 8304 VS1: 8171
#> I: 5422 VS2: 12258
#> J: 2808 VVS1: 3655
#> VVS2: 5066
I tried this on a table too big to fit in memory and it worked. This function doesn't exactly match what's given by summary.data.frame
, but should be a start.