Search code examples
rextractchemistry

R - Extracting a number associated with a character


I am trying to extract the number of carbons, hydrogens, and oxygens from a chemical formula. I previously found some code that I have been trying to use. The problem is that the code only works when the chemical formula has more than 1 of the element.

V <- DATA # example: CH4O, H2O, C10H18O2
# V is a data.frame

C1 <- as.integer(sub("(?i).*?C:?\\s*(\\d+).*", "\\1", V))
# NA NA 10
H1 <- as.integer(sub("(?i).*?H:?\\s*(\\d+).*", "\\1", V))
# 4 2 18
O1 <- as.integer(sub("(?i).*?O:?\\s*(\\d+).*", "\\1", V))
# NA NA 2

I am currently using

is.na(C1) <- 1

to get the NA's changed to 1's and then manually changing the 0 values. Is there a more efficient code I can use to get the proper counts of the elements in the chemical formulas (specificially in the cases that the value is 0 or 1 and causing NA results). Let me know if you need more information or if I should change some of the format.

EDIT: The desired values would be to get all the correct counts without the NA and manually changing values to 0 if possible.

C1
# 1 0 10
H1
# 4 2 18
O1
# 1 1 2

EDIT2: Here is an example of the actual data I am importing

Meas. m/z   #   Ion Formula Score   m/z err [mDa]   err [ppm]   mSigma  rdb e¯ Conf Adduct  
84.080700   1   C5H10N  n.a.    84.080776   0.1 0.9 n.a.    2.0 even        
89.060100   1   C4H9O2  n.a.    89.059706   -0.4    -4.4    n.a.    1.0 even        
131.987800  1   C2H4N3P2    n.a.    131.987498  -0.3    -2.3    n.a.    6.0 even        
135.081100  1   C9H11O  n.a.    135.080441  -0.7    -4.9    n.a.    5.0 even        
135.117500  1   C10H15  n.a.    135.116827  -0.7    -5.0    n.a.    4.0 even        
136.061700  1   C5H6N5  n.a.    136.061772  0.1 0.5 n.a.    6.0 even        

In the initial question i just listed V as coming from a vector of forlumas, but what I actually have is a data.frame with other information and I use V[,3] when performing the calculations to get the column of interest.


Solution

  • Here's an alternative:

    vec <- c("CH4O", "H2O", "C10H18O2", "C2H4N3P2")
    
    molecules <- regmatches(vec, gregexpr("\\b[A-Z][a-z]*\\d*", vec))
    molecules <- lapply(molecules, function(a) paste0(a, ifelse(grepl("[^0-9]$", a), "1", "")))
    
    atomcounts <- lapply(molecules, function(mol) setNames(as.integer(gsub("\\D", "", mol)), gsub("\\d", "", mol)))
    
    atoms <- unique(unlist(sapply(atomcounts, names)))
    atoms <- sapply(atoms, function(atom) sapply(atomcounts, function(a) if (atom %in% names(a)) a[atom] else 0))
    rownames(atoms) <- vec
    atoms
    #           C  H O N P
    # CH4O      1  4 1 0 0
    # H2O       0  2 1 0 0
    # C10H18O2 10 18 2 0 0
    # C2H4N3P2  2  4 0 3 2