First post here so I hope I can explain myself at the best.
I need to cross-reference two dataframes by finding if one specific chromosome location given in one of the two dataframes occurs in the range provided by the other one, and as result I would like to have a new column with the gene present in that range.
"genes"is the dataframe with the coordinates (start/end) to be considered as the range
head(genes)
# A tibble: 6 x 9
chr source type start end strand gene_id symbol gene_biotype
<chr> <chr> <chr> <int> <int> <chr> <chr> <chr> <chr>
1 2 pseudogene gene 143300987 143301544 + ENSG00000228134 AC092578.1 pseudogene
2 2 pseudogene gene 143611664 143613567 + ENSG00000229781 AC013444.1 pseudogene
3 2 protein_coding gene 143635067 143799890 + ENSG00000115919 KYNU protein_coding
4 2 pseudogene gene 143704869 143705655 - ENSG00000270390 RP11-470B22.1 pseudogene
5 2 miRNA gene 143763269 143763360 - ENSG00000221169 AC013444.2 miRNA
6 2 protein_coding gene 143848931 144525921 + ENSG00000075884 ARHGAP15 protein_coding
the other data frame(x) is:
chr_a point A
1 2 143301002
2 2 143625061
3 2 143700941
4 2 143811317
5 2 144127323
6 2 144224689
I basically have to find whether "point A" falls between "start"/ "end" range in (genes) and which gene symbol is associated.
I tried the following:
x$geneA <- ifelse(sapply(x$`point A`, function(g)
any(genes$start >= g & genes$end <=g)), genes$symbol, NA)
but the results I get are not in line with the genomic coordinates.
Hope someone can help me! Thx!
Does this work?
I'm assuming that each point matches to only one gene symbol.
x$geneA <- sapply(x$`point A`,
function(g) filter(genes, g >= start & g <= end)$symbol[1])
Result:
x
# A tibble: 6 x 3
chr_a `point A` geneA
<int> <int> <chr>
1 2 143301002 AC092578.1
2 2 143625061 NA
3 2 143700941 KYNU
4 2 143811317 NA
5 2 144127323 ARHGAP15
6 2 144224689 ARHGAP15