I have built a function with apply
and outer
that is working great. However, one issue is that I cannot figure out how to add a condition statement without breaking it.
The function currently is:
df$match <- apply(outer(df$c2, df$c2, function(x, y) abs(x - y) < 1e7) &
diag(nrow(df)) == 0,
MARGIN = 1,
function(x) paste(df$c3[x], collapse = ", "))
Basically, it looks to see if a numeric value in c2
of one row within a certain threshold (1e7) of any other row in df
and then tells me which row(s) it matches by returned a specific row ID I have stored in c3
. However, I want to modify this to only compare rows with the same numeric value in another column: c1
.
Here is what the data looks like:
c1 c2 c3 match
1 52426577 chr1.52426577_T chr2.43668478_G, chr2.47207905_G, chr2.47959251_C, chr2.49606475_C
1 108023890 chr1.108023890_A
1 129776943 chr1.129776943_T
2 39943710 chr2.39943710_C chr2.43668478_G, chr2.47207905_G, chr2.47959251_C,
2 43668478 chr2.43668478_G chr1.52426577_T, chr2.39943710_C, chr2.47207905_G, chr2.47959251_C
2 47207905 chr2.47207905_G chr1.52426577_T, chr2.39943710_C, chr2.43668478_G, chr2.47959251_C
2 47959251 chr2.47959251_C chr1.52426577_T, chr2.39943710_C, chr2.43668478_G, chr2.47207905_G
I've tried this but it doesn't work. The error I get is: attempt to apply non-function
df$match <- apply(outer(df$c2, df$c2, function(x, y) if(df$c1(x) == df$c1(y)) abs(x - y) < 1e7) &
diag(nrow(df)) == 0,
MARGIN = 1,
function(x) paste(df$c3[x], collapse = ", "))
This is my desired result:
c1 c2 c3 match
1 52426577 chr1.52426577_T
1 108023890 chr1.108023890_A
1 129776943 chr1.129776943_T
2 39943710 chr2.39943710_C chr2.43668478_G, chr2.47207905_G, chr2.47959251_C
2 43668478 chr2.43668478_G chr2.39943710_C, chr2.47207905_G, chr2.47959251_C
2 47207905 chr2.47207905_G chr2.39943710_C, chr2.43668478_G, chr2.47959251_C
2 47959251 chr2.47959251_C chr2.39943710_C, chr2.43668478_G, chr2.47207905_G
I've also tried grouping in dplyr to no avail:
df$match <- sapply(df$c3, function(x){
df %>%
group_by(c1) %>%
filter(abs(c2 - c2[c3 == x]) < 1e7,
c3 != x) %>%
pull(c3) %>%
paste0(collapse = ',')
})
Error in `filter()`:
! Problem while computing `..1 = abs(d2 - d2[d3 == x]) < 1e+07`.
✖ Input `..1` must be of size 17 or 1, not size 0.
ℹ The error occurred in group 2: d1 = 2.
You can apply your function by group using dplyr
:
library(dplyr)
dat %>%
group_by(c1) %>%
mutate(match = apply(`diag<-`(abs(outer(c2, c2, `-`)) < 1e7, FALSE), 1, \(x) toString(c3[x]))) %>%
ungroup()
# A tibble: 7 × 4
c1 c2 c3 match
<dbl> <dbl> <chr> <chr>
1 1 52426577 chr1.52426577_T ""
2 1 108023890 chr1.108023890_A ""
3 1 129776943 chr1.129776943_T ""
4 2 39943710 chr2.39943710_C "chr2.43668478_G, chr2.47207905_G, chr2.47959251_C"
5 2 43668478 chr2.43668478_G "chr2.39943710_C, chr2.47207905_G, chr2.47959251_C"
6 2 47207905 chr2.47207905_G "chr2.39943710_C, chr2.43668478_G, chr2.47959251_C"
7 2 47959251 chr2.47959251_C "chr2.39943710_C, chr2.43668478_G, chr2.47207905_G"
Or sticking with base R:
transform(dat, match = unlist(lapply(split(dat, ~ c1), \(g) with(
g, apply(`diag<-`(abs(outer(
c2, c2, `-`
)) < 1e7, FALSE), 1, \(x) toString(c3[x]))
))))
c1 c2 c3 match
11 1 52426577 chr1.52426577_T
12 1 108023890 chr1.108023890_A
13 1 129776943 chr1.129776943_T
21 2 39943710 chr2.39943710_C chr2.43668478_G, chr2.47207905_G, chr2.47959251_C
22 2 43668478 chr2.43668478_G chr2.39943710_C, chr2.47207905_G, chr2.47959251_C
23 2 47207905 chr2.47207905_G chr2.39943710_C, chr2.43668478_G, chr2.47959251_C
24 2 47959251 chr2.47959251_C chr2.39943710_C, chr2.43668478_G, chr2.47207905_G