I've two data frames. One with two columns and other with three columns. First data frame has SNP names and its position. The second data frame with three columns has columns with gene name and start and end coordinates of the genes.
I'm interested to perform a join based on the boundaries. If a SNP falls within gene boundaries return it
dt_snp<-data.table("SNP"=c(paste("SNP",seq(1:10),sep="")),
"BP"=c(1100, 89200, 2500, 33000, 5500, 69500, 12000,8800, 23200, 27000 )) ## SNP data
dt_gene<-data.table("GENE"=c("GENE1","GENE2","GENE3","GENE4","GENE5"),
"START"=c(1000,2100,5000,40000,23000), "END"=c(2000,3000,9000,45000,30000)) ## Gene data
## do a join using data.table
snp_withingenes<-dt_snp[dt_gene, c("SNP","BP","GENE","START","END"), on=.(BP>=START, BP<=END), nomatch=0] # inner join
I get desired results with it, however when I perform this task in a R script that is stored in an R package I get a warning for .
operator. The warning is as:
function_small: no visible global function definition for ‘.’
Undefined global functions or variables:
.
Thus I'd like to use foverlaps
but I'm having a hard time understanding and achieve desired results with it. It is counter-intuitive for me
foverlaps(dt_snp,dt_gene, by.x=c("SNP","BP"), by.y=c("GENE","START","END"), nomatch=NA, type="any")
Error in foverlaps(dt_snp, dt_gene, by.x = c("SNP", "BP"), by.y = c("GENE", :
The first 3 columns of y's key must be identical to the columns specified in by.y.
How should I be able to obtain output as desired?
data.table_1.13.0
R v4.0
windows platform
The check
from devtools annoys for .
operator on R v4.0
rmarkdown_2.3
devtools_2.3.1
UNIX platform
To expand on my comment, here is the foverlaps option, which requires two columns in both data.tables
, thus seems suboptimal here:
library(data.table)
dt_snp <- data.table("SNP"=c(paste("SNP",seq(1:10),sep="")),
"BP"=c(1100, 89200, 2500, 33000, 5500, 69500, 12000,8800, 23200, 27000 )) ## SNP data
dt_gene <- data.table("GENE"=c("GENE1","GENE2","GENE3","GENE4","GENE5"),
"START"=c(1000,2100,5000,40000,23000), "END"=c(2000,3000,9000,45000,30000)) ## Gene data
setkey(dt_gene, START, END)
dt_snp[, BP2 := BP]
## do a join using data.table
dt_snp[dt_gene, c("SNP","BP","GENE","START","END"), on=list(BP2 >= START, BP2 <= END), nomatch=0][]
#> SNP BP GENE START END
#> 1: SNP1 1100 GENE1 1000 2000
#> 2: SNP3 2500 GENE2 2100 3000
#> 3: SNP5 5500 GENE3 5000 9000
#> 4: SNP8 8800 GENE3 5000 9000
#> 5: SNP9 23200 GENE5 23000 30000
#> 6: SNP10 27000 GENE5 23000 30000
setkey(dt_snp, BP, BP2)
foverlaps(dt_snp,dt_gene, by.x=c("BP", "BP2"), by.y=c("START","END"), nomatch=NULL, type="any")[, BP2 := NULL][]
#> GENE START END SNP BP
#> 1: GENE1 1000 2000 SNP1 1100
#> 2: GENE2 2100 3000 SNP3 2500
#> 3: GENE3 5000 9000 SNP5 5500
#> 4: GENE3 5000 9000 SNP8 8800
#> 5: GENE5 23000 30000 SNP9 23200
#> 6: GENE5 23000 30000 SNP10 27000
Created on 2020-08-06 by the reprex package (v0.3.0)