I try to make the mean, max, min and sd extraction inside 5-95 quantiles in a BigQuery server, but dplyr
verbs don't work and the output error is: x Syntax error: Expected ")" but got keyword AS at [1:117] [invalidQuery]
In my example:
library(tidyverse)
library(bigrquery)
library(DBI)
library(googleAuthR)
bq_auth(email = "[email protected]")
projectid<-'helical-client-323918'
datasetid<-'spectra_calibration'
bq_conn <- dbConnect(bigquery(),
project = projectid,
dataset = datasetid,
use_legacy_sql = FALSE
)
bigrquery::dbListTables(bq_conn) # List all the tables in BigQuery data set
#[1] "CMPC"
#
raw_spectra_CMPC <- dplyr::tbl(bq_conn,
"CMPC") # connects to a table
glimpse(raw_spectra_CMPC)
Rows: ??
Columns: 27
Database: BigQueryConnection
$ x <dbl> -52.5502, -52.5501, -52.5501, -52.5501, -52.5501, -52.5500, -52.5500, -52.5500, -52.5500, -52.5500, -5~
$ y <dbl> -30.8295, -30.8297, -30.8296, -30.8295, -30.8294, -30.8298, -30.8297, -30.8296, -30.8295, -30.8294, -3~
$ stand <chr> "ABRANJO001A", "ABRANJO001A", "ABRANJO001A", "ABRANJO001A", "ABRANJO001A", "ABRANJO001A", "ABRANJO001A~
$ date <chr> "2019-01-28", "2019-01-28", "2019-01-28", "2019-01-28", "2019-01-28", "2019-01-28", "2019-01-28", "201~
$ B2 <dbl> 213, 205, 181, 207, 216, 205, 165, 161, 173, 182, 181, 259, 227, 190, 153, 147, 160, 164, 194, 210, 18~
$ B3 <dbl> 361.0, 362.0, 346.0, 352.0, 369.0, 330.0, 290.0, 326.0, 334.0, 332.0, 325.0, 375.0, 352.0, 307.0, 281.~
$ B4 <dbl> 227.0, 233.0, 198.0, 207.0, 209.0, 227.0, 178.0, 164.0, 180.0, 207.0, 209.0, 267.0, 269.0, 194.0, 163.~
$ B8 <dbl> 3033.0, 3307.0, 3322.0, 3232.0, 3241.0, 3065.0, 3306.0, 3422.0, 3427.0, 3392.0, 3165.0, 3206.0, 2984.0~
$ NDVI <dbl> 0.86074, 0.86836, 0.88750, 0.87962, 0.87884, 0.86209, 0.89782, 0.90853, 0.90019, 0.88497, 0.87611, 0.8~
$ SAVI <dbl> 4549.379, 4960.386, 4982.905, 4847.897, 4861.397, 4597.380, 4958.915, 5132.925, 5140.417, 5087.903, 47~
$ SIPI <dbl> 1.00499, 1.00911, 1.00544, 1.00000, 0.99769, 1.00775, 1.00416, 1.00092, 1.00216, 1.00785, 1.00947, 1.0~
$ SR <dbl> 13.36123, 14.19313, 16.77778, 15.61353, 15.50718, 13.50220, 18.57303, 20.86585, 19.03889, 16.38647, 15~
$ RGI <dbl> 0.62881, 0.64365, 0.57225, 0.58807, 0.56640, 0.68788, 0.61379, 0.50307, 0.53892, 0.62349, 0.64308, 0.7~
$ TVI <int> 173720, 189600, 193360, 187300, 188320, 174400, 192160, 201960, 200980, 196100, 182000, 180660, 166220~
$ MSR <dbl> 3.65530, 3.76738, 4.09607, 3.95140, 3.93792, 3.67453, 4.30964, 4.56792, 4.36336, 4.04802, 3.89147, 3.4~
$ PRI <dbl> -0.25784, -0.27690, -0.31309, -0.25939, -0.26154, -0.23364, -0.27473, -0.33881, -0.31755, -0.29183, -0~
$ GNDVI <dbl> 0.78727, 0.80267, 0.81134, 0.80357, 0.79557, 0.80560, 0.83871, 0.82604, 0.82239, 0.82170, 0.81375, 0.7~
$ PSRI <dbl> -0.04418, -0.03901, -0.04455, -0.04486, -0.04937, -0.03361, -0.03388, -0.04734, -0.04494, -0.03685, -0~
$ GCI <dbl> 7.40166, 8.13536, 8.60116, 8.18182, 7.78320, 8.28788, 10.40000, 9.49693, 9.26048, 9.21687, 8.73846, 7.~
$ ID_PROJETO <int> 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 24~
$ PROJETO <chr> "ABRANJO", "ABRANJO", "ABRANJO", "ABRANJO", "ABRANJO", "ABRANJO", "ABRANJO", "ABRANJO", "ABRANJO", "AB~
$ CD_TALHAO <chr> "001A", "001A", "001A", "001A", "001A", "001A", "001A", "001A", "001A", "001A", "001A", "001A", "001A"~
$ DATA_PLANT <chr> "2008-07-15", "2008-07-15", "2008-07-15", "2008-07-15", "2008-07-15", "2008-07-15", "2008-07-15", "200~
$ ESPECIE <chr> "SALIGNA", "SALIGNA", "SALIGNA", "SALIGNA", "SALIGNA", "SALIGNA", "SALIGNA", "SALIGNA", "SALIGNA", "SA~
$ ESPAC <chr> "3.5x2.14", "3.5x2.14", "3.5x2.14", "3.5x2.14", "3.5x2.14", "3.5x2.14", "3.5x2.14", "3.5x2.14", "3.5x2~
$ AGE_1 <dbl> 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, ~
$ AGE <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11~
But when I try to used my function:
quantis <- function(x) x > quantile(x, 0.05) & x < quantile(x, 0.95)
data_Q95 <- raw_spectra_CMPC %>%
filter (across(B2:GCI, .fns = quantis)) %>%
group_by(AGE, ESPAC) %>%
summarise(across(B2:GCI, list(mean = ~mean(.x, na.rm = TRUE), max = ~ max(.x, na.rm = TRUE), min = ~ min(.x, na.rm = TRUE), sd = ~ sd(.x, na.rm = TRUE))))
data_Q95
# <error/rlang_error>
# Job 'fc-vm-v1.job_41KNdnsYP_gws6avquMVeqjOC6_V.US' failed
# x Syntax error: Expected ")" but got keyword AS at [1:117] [invalidQuery]
# Backtrace:
# 1. (function (x, ...) ...
# 2. dbplyr:::print.tbl_sql(x)
# 7. pillar:::format.tbl(x, ..., n = n, width = width, n_extra = n_extra)
# 8. pillar:::format_tbl(...)
# 9. pillar::tbl_format_setup(...)
# 11. pillar:::tbl_format_setup.tbl(...)
# 12. pillar:::df_head(x, n + 1)
# 14. dbplyr:::as.data.frame.tbl_sql(head(x, n))
# 17. bigrquery:::collect.tbl_BigQueryConnection(x, n = n)
# 18. bigrquery::bq_dataset_query(...)
# 19. bigrquery::bq_job_wait(job, quiet = quiet)
# Run `rlang::last_trace()` to see the full context.
Please, any help with a good way for me to make these extractions? Is necessary to use SQL query too? I make some tests with my function in a *csv file before export to the BigQuery and works very well:
raw_spectra_CMPC_csv <- read.csv("https://raw.githubusercontent.com/Leprechault/trash/main/my_ds_CSV.csv")
quantis <- function(x) x > quantile(x, 0.05) & x < quantile(x, 0.95)
data_Q95 <- raw_spectra_CMPC_csv %>%
filter (across(B2:GCI, .fns = quantis)) %>%
group_by(AGE, ESPAC) %>%
summarise(across(B2:GCI, list(mean = ~mean(.x, na.rm = TRUE), max = ~ max(.x, na.rm = TRUE), min = ~ min(.x, na.rm = TRUE), sd = ~ sd(.x, na.rm = TRUE))))
data_Q95
# # A tibble: 7 x 62
# # Groups: AGE [7]
# AGE ESPAC B2_mean B2_max B2_min B2_sd B3_mean B3_max B3_min B3_sd B4_mean B4_max B4_min B4_sd B8_mean B8_max B8_min
# <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 -2 4X1.85 125. 175 75 14.2 236. 312 199 18.6 129. 345 108. 12.6 3590. 4052 2189
# 2 -1 4X1.85 153. 300 67 34.0 282. 436 198. 34.3 166. 379 108. 30.7 3346. 4059 2208
# 3 0 4X1.85 419. 928. 71 274. 610. 1208 199 328. 730. 1668 109 535. 2933. 4069 2095
# 4 1 4X1.85 344. 683 129 83.4 510. 944 286 97.0 544. 1180 256 129. 2871. 3451 2115
# 5 11 3.5x2.14 137. 259 70 29.8 276. 467 199 38.4 160. 361 109 26.4 3665. 4069 2688
# 6 12 3.5x2.14 150. 298 67.5 23.6 267. 485 200 32.1 169. 421 109 26.2 3354. 4067 2293
# 7 13 3.5x2.14 130. 302 70 35.3 247. 482 200 30.0 144. 465 111 29.7 3833. 4069 3116
# # ... with 45 more variables: B8_sd <dbl>, NDVI_mean <dbl>, NDVI_max <dbl>, NDVI_min <dbl>, NDVI_sd <dbl>,
# # SAVI_mean <dbl>, SAVI_max <dbl>, SAVI_min <dbl>, SAVI_sd <dbl>, SIPI_mean <dbl>, SIPI_max <dbl>, SIPI_min <dbl>,
# # SIPI_sd <dbl>, SR_mean <dbl>, SR_max <dbl>, SR_min <dbl>, SR_sd <dbl>, RGI_mean <dbl>, RGI_max <dbl>, RGI_min <dbl>,
# # RGI_sd <dbl>, TVI_mean <dbl>, TVI_max <dbl>, TVI_min <dbl>, TVI_sd <dbl>, MSR_mean <dbl>, MSR_max <dbl>,
# # MSR_min <dbl>, MSR_sd <dbl>, PRI_mean <dbl>, PRI_max <dbl>, PRI_min <dbl>, PRI_sd <dbl>, GNDVI_mean <dbl>,
# # GNDVI_max <dbl>, GNDVI_min <dbl>, GNDVI_sd <dbl>, PSRI_mean <dbl>, PSRI_max <dbl>, PSRI_min <dbl>, PSRI_sd <dbl>,
Thanks in advance!
I tried reproducing your code and I noticed that R code is not properly translated to BQ query.
sd(.x)
since by default BQ (STTD_DEV) ignores the null values.quantis
is not created in BQ, thus it does not do its job and errors out. I'm not sure if R supports use of functions to BQ.What I could suggest is instead of using native R operations use SQL statements to prevent incorrect translation of R to BigQuery operations. You can try creating a user defined function (your quantis
function) in BQ. In your select statement perform mean, max and sd on your fields. Filter using your UDF(quantis), group by age and espac. You can also try creating a VIEW that achieves most of your goal including WHERE clause on quantile. You can refer to this document for reference on how to use BigQuery in R.