Search code examples
rdata.tabledevtools

foverlaps data.table || error y's key must be identical to the columns specified in by.y


I've two data frames. One with two columns and other with three columns. First data frame has SNP names and its position. The second data frame with three columns has columns with gene name and start and end coordinates of the genes.
I'm interested to perform a join based on the boundaries. If a SNP falls within gene boundaries return it

dt_snp<-data.table("SNP"=c(paste("SNP",seq(1:10),sep="")), 
"BP"=c(1100, 89200, 2500, 33000, 5500, 69500, 12000,8800, 23200, 27000  )) ## SNP data

dt_gene<-data.table("GENE"=c("GENE1","GENE2","GENE3","GENE4","GENE5"), 
"START"=c(1000,2100,5000,40000,23000), "END"=c(2000,3000,9000,45000,30000)) ## Gene data

## do a join using data.table
snp_withingenes<-dt_snp[dt_gene, c("SNP","BP","GENE","START","END"), on=.(BP>=START, BP<=END), nomatch=0] # inner join

I get desired results with it, however when I perform this task in a R script that is stored in an R package I get a warning for . operator. The warning is as:

 function_small: no visible global function definition for ‘.’
  Undefined global functions or variables:
    .

Thus I'd like to use foverlaps but I'm having a hard time understanding and achieve desired results with it. It is counter-intuitive for me

foverlaps(dt_snp,dt_gene, by.x=c("SNP","BP"), by.y=c("GENE","START","END"), nomatch=NA, type="any")

Error in foverlaps(dt_snp, dt_gene, by.x = c("SNP", "BP"), by.y = c("GENE",  : 
  The first 3 columns of y's key must be identical to the columns specified in by.y.

How should I be able to obtain output as desired?

data.table_1.13.0 R v4.0 windows platform

The checkfrom devtools annoys for . operator on R v4.0 rmarkdown_2.3 devtools_2.3.1 UNIX platform


Solution

  • To expand on my comment, here is the foverlaps option, which requires two columns in both data.tables, thus seems suboptimal here:

    library(data.table)
    dt_snp <- data.table("SNP"=c(paste("SNP",seq(1:10),sep="")), 
                       "BP"=c(1100, 89200, 2500, 33000, 5500, 69500, 12000,8800, 23200, 27000  )) ## SNP data
    
    dt_gene <- data.table("GENE"=c("GENE1","GENE2","GENE3","GENE4","GENE5"), 
                        "START"=c(1000,2100,5000,40000,23000), "END"=c(2000,3000,9000,45000,30000)) ## Gene data
    setkey(dt_gene, START, END)
    
    dt_snp[, BP2 := BP]
    ## do a join using data.table
    dt_snp[dt_gene, c("SNP","BP","GENE","START","END"), on=list(BP2 >= START, BP2 <= END), nomatch=0][]
    #>      SNP    BP  GENE START   END
    #> 1:  SNP1  1100 GENE1  1000  2000
    #> 2:  SNP3  2500 GENE2  2100  3000
    #> 3:  SNP5  5500 GENE3  5000  9000
    #> 4:  SNP8  8800 GENE3  5000  9000
    #> 5:  SNP9 23200 GENE5 23000 30000
    #> 6: SNP10 27000 GENE5 23000 30000
    
    setkey(dt_snp, BP, BP2)
    foverlaps(dt_snp,dt_gene, by.x=c("BP", "BP2"), by.y=c("START","END"), nomatch=NULL, type="any")[, BP2 := NULL][]
    #>     GENE START   END   SNP    BP
    #> 1: GENE1  1000  2000  SNP1  1100
    #> 2: GENE2  2100  3000  SNP3  2500
    #> 3: GENE3  5000  9000  SNP5  5500
    #> 4: GENE3  5000  9000  SNP8  8800
    #> 5: GENE5 23000 30000  SNP9 23200
    #> 6: GENE5 23000 30000 SNP10 27000
    

    Created on 2020-08-06 by the reprex package (v0.3.0)