Search code examples
rdplyrbioinformaticsabsolute-value

How to get the absolute value while noting if the number used to be negative?


I have a genetic dataset where each row describes a gene and has a beta column with multiple beta values I've compressed into one row/cell (from the variant level where multiple variants in one gene gave multiple betas). The beta is the effect size that the gene can have on a condition so large negative values are important as well as large positive values. I am trying to write code that selects the absolute value from the rows, and then trying to create another new column which records if the absolute value used to be negative - I have a biology background so I'm not sure if this is possible or the best way to do this?

For example my data looks like this:

Gene    Beta
ACE     0.01, -0.6, 0.4
BRCA    0.7, -0.2, 0.2 
ZAP70   NA
P53     0.8, -0.6, 0.001

Expected output something like this (selecting absolute value and keeping track of which numbers use to be negative):

Gene    Beta     Negatives
ACE      0.6         1
BRCA     0.7         0
ZAP70    NA          NA
P53      0.8         0

I am currently stuck on getting the absolute value from each row, what I am trying is this:

abs2 = function(x) if(all(is.na(x))) NA else abs(x,na.rm = T)
getabs = function(col) str_extract_all(col,"[0-9\\.-]+") %>%
  lapply(.,function(x)abs2(as.numeric(x)) ) %>%
  unlist() 

test <- df %>%
  mutate_at(names(df)[2],getabs)

#Outputs:
 Error in abs(x, na.rm = T) : 2 arguments passed to 'abs' which requires 1 

Any help on how to just get the absolute value per cell/row would be appreciated, as I assume I could also make a column getting the largest negative value, match that to identical absolute values and use that as my negatives record.

Input data:

dput(df)
structure(list(Gene = c("ACE", "BRCA", "ZAP70", "P53"), `Beta` = c("0.01, -0.6, 0.4", 
"0.7, -0.2, 0.2", "0.001, 0.02, -0.003", "0.8, -0.6, 0.001")), row.names = c(NA, 
-4L), class = c("data.table", "data.frame"))

Solution

  • One way using dplyr is to get the comma-separated value into separate rows, group_by Gene get the max absolute value of Beta and check if that value is negative.

    library(dplyr)
    
    df %>%
      tidyr::separate_rows(Beta, sep = ",", convert = TRUE) %>%
      group_by(Gene) %>%
      summarise(Negatives = +(min(Beta) == -max(abs(Beta))),
                Beta = max(abs(Beta), na.rm = TRUE))
    
    # A tibble: 4 x 3
    #  Gene  Negatives   Beta
    #  <fct>     <int>  <dbl>
    #1 ACE           1    0.6
    #2 BRCA          0    0.7
    #3 P53           0    0.8
    #4 ZAP70        NA   -Inf  
    

    data

    df <- structure(list(Gene = structure(c(1L, 2L, 4L, 3L), .Label = c("ACE", 
    "BRCA", "P53", "ZAP70"), class = "factor"), Beta = structure(c(1L, 
    2L, NA, 3L), .Label = c("0.01, -0.6, 0.4", "0.7, -0.2, 0.2", 
    "0.8, -0.6, 0.001"), class = "factor")), class = "data.frame", 
    row.names = c(NA, -4L))