I have a genetic dataset where each row describes a gene and has a beta column with multiple beta values I've compressed into one row/cell (from the variant level where multiple variants in one gene gave multiple betas). The beta is the effect size that the gene can have on a condition so large negative values are important as well as large positive values. I am trying to write code that selects the absolute value from the rows, and then trying to create another new column which records if the absolute value used to be negative - I have a biology background so I'm not sure if this is possible or the best way to do this?
For example my data looks like this:
Gene Beta
ACE 0.01, -0.6, 0.4
BRCA 0.7, -0.2, 0.2
ZAP70 NA
P53 0.8, -0.6, 0.001
Expected output something like this (selecting absolute value and keeping track of which numbers use to be negative):
Gene Beta Negatives
ACE 0.6 1
BRCA 0.7 0
ZAP70 NA NA
P53 0.8 0
I am currently stuck on getting the absolute value from each row, what I am trying is this:
abs2 = function(x) if(all(is.na(x))) NA else abs(x,na.rm = T)
getabs = function(col) str_extract_all(col,"[0-9\\.-]+") %>%
lapply(.,function(x)abs2(as.numeric(x)) ) %>%
unlist()
test <- df %>%
mutate_at(names(df)[2],getabs)
#Outputs:
Error in abs(x, na.rm = T) : 2 arguments passed to 'abs' which requires 1
Any help on how to just get the absolute value per cell/row would be appreciated, as I assume I could also make a column getting the largest negative value, match that to identical absolute values and use that as my negatives record.
Input data:
dput(df)
structure(list(Gene = c("ACE", "BRCA", "ZAP70", "P53"), `Beta` = c("0.01, -0.6, 0.4",
"0.7, -0.2, 0.2", "0.001, 0.02, -0.003", "0.8, -0.6, 0.001")), row.names = c(NA,
-4L), class = c("data.table", "data.frame"))
One way using dplyr
is to get the comma-separated value into separate rows, group_by
Gene
get the max
absolute value of Beta
and check if that value is negative.
library(dplyr)
df %>%
tidyr::separate_rows(Beta, sep = ",", convert = TRUE) %>%
group_by(Gene) %>%
summarise(Negatives = +(min(Beta) == -max(abs(Beta))),
Beta = max(abs(Beta), na.rm = TRUE))
# A tibble: 4 x 3
# Gene Negatives Beta
# <fct> <int> <dbl>
#1 ACE 1 0.6
#2 BRCA 0 0.7
#3 P53 0 0.8
#4 ZAP70 NA -Inf
data
df <- structure(list(Gene = structure(c(1L, 2L, 4L, 3L), .Label = c("ACE",
"BRCA", "P53", "ZAP70"), class = "factor"), Beta = structure(c(1L,
2L, NA, 3L), .Label = c("0.01, -0.6, 0.4", "0.7, -0.2, 0.2",
"0.8, -0.6, 0.001"), class = "factor")), class = "data.frame",
row.names = c(NA, -4L))