Search code examples
rrsqlite

RSQLite parametrised query with vectors as parameters


I am new to SQL and its syntax, and I cannot understand how to pass multiple values (e.g. vector or list) to a single parameter in a parametrised query in R using RSQLite.

I have a two tables database (myTCGA) with data coming from RNASeq data. The first (tcga_P) contains expression values (FPKM) of samples for some genes, while (tcgaMeta) contains the metadata information of those samples.

#tcga_P
      FPKM           Sample                 Tissue GeneName
5550 0.0633 TCGA-AB-2803-03A Acute_Myeloid_Leukemia  PLEKHN1
5551 0.2390 TCGA-AB-2805-03A Acute_Myeloid_Leukemia  PLEKHN1
5552 0.0253 TCGA-AB-2806-03A Acute_Myeloid_Leukemia  PLEKHN1
5553 0.0385 TCGA-AB-2807-03A Acute_Myeloid_Leukemia  PLEKHN1
5554 0.0326 TCGA-AB-2808-03A Acute_Myeloid_Leukemia  PLEKHN1
5555 0.2836 TCGA-AB-2810-03A Acute_Myeloid_Leukemia  PLEKHN1

# tcgaMeta (only few columns)
SampleIndex         SampleID        SubjectID Tumor.Type         Sample.Type
  1           0 TCGA-01-0628-11A TCGA-01-0628         OV Solid Tissue Normal
  2           1 TCGA-01-0630-11A TCGA-01-0630         OV Solid Tissue Normal
  3           2 TCGA-01-0631-11A TCGA-01-0631         OV Solid Tissue Normal

I want to extract the expression values from tcga_P of only the samples that belong to specific groups (e.g. all lung samples). To do so, I wrote a query that looks like this

library(DBI)
library(RSQLite)
library(data.table)

myGene <- "PLEKHN1"
myTissue <- "lung"
myCancer <- "Lung Adenocarcinoma"
selectedSamples <- dbGetQuery(myTCGA, 
     "SELECT A.*
     FROM tcga_P A 
     WHERE A.GeneName = $gene AND
           A.Sample in (SELECT B.SampleID FROM tcgaMeta B 
                WHERE B.Tissue = $tissue AND 
                      B.`Disease.TCGA.` = $cancer )
     ",param = list(gene=myGene,tissue=myTissue,cancer=myCancer))
 # from long to wide
 selectedSamplesWide <- dcast(selectedSamples,GeneName~Sample, value.var = "FPKM",fun.aggregate = sum)

This query correctly returns the values only if I supply one gene at the time, but it fails if I try to extract the values from multiple genes (myGene <- c("PLEKHN1","PSMD12"))

Error in result_bind(res@ptr, params) : 
Parameter 2 does not have length 2.

I could loop (sapply) through the genes in the vector, make the sql call one gene at the time and bind the results together, but I would like to do everything inside the sql call.

So far I tried with

WHERE A.GeneName IN ($gene)
WHERE A.GeneName IN (SELECT C.GeneName FROM $gene C)

I also tried to convert myGene to a data.frame and treat the genes as a column. It is superfluous to say that nothing worked.

What am I missing? how are the parameters passed to param = list()?


Solution

  • In SQL, the WHERE clause with equality, =, expects one value where you are attempting to evaluate expression to two values. However, the WHERE with IN allows for multiple values:

    WHERE A.GeneName IN ('PLEKHN1', 'PSMD12', ...)
    

    For an open-ended number of values, consider dynamically creating the prepared statement with paste + collapse and binding parameter values with setNames and as.list:

    myGene <- c("PLEKHN1", "PSMD12")
    myTissue <- "lung"
    myCancer <- "Lung Adenocarcinoma"
    
    myPlaceHolders <- paste0("$gene", seq_along(myGene))
    
    sql <- paste0("SELECT A.*
                   FROM tcga_P A 
                   WHERE A.GeneName IN (", paste(myPlaceHolders, collapse=", "), ") 
                     AND A.Sample in (SELECT B.SampleID 
                                      FROM tcgaMeta B 
                                      WHERE B.Tissue = $tissue 
                                        AND B.`Disease.TCGA.` = $cancer)
                 ")
    
    myGeneParams <- as.list(setNames(myGene, gsub("\\$", "", myPlaceHolders)))
    paramList <- c(myGeneParams, tissue=myTissue, cancer=myCancer)
    
    selectedSamples <- dbGetQuery(myTCGA, sql, param = myParamList)
    

    Rextester demo (for prepared statement and parameter output not query run)