I am trying to extract the number of carbons, hydrogens, and oxygens from a chemical formula. I previously found some code that I have been trying to use. The problem is that the code only works when the chemical formula has more than 1 of the element.
V <- DATA # example: CH4O, H2O, C10H18O2
# V is a data.frame
C1 <- as.integer(sub("(?i).*?C:?\\s*(\\d+).*", "\\1", V))
# NA NA 10
H1 <- as.integer(sub("(?i).*?H:?\\s*(\\d+).*", "\\1", V))
# 4 2 18
O1 <- as.integer(sub("(?i).*?O:?\\s*(\\d+).*", "\\1", V))
# NA NA 2
I am currently using
is.na(C1) <- 1
to get the NA's changed to 1's and then manually changing the 0 values. Is there a more efficient code I can use to get the proper counts of the elements in the chemical formulas (specificially in the cases that the value is 0 or 1 and causing NA results). Let me know if you need more information or if I should change some of the format.
EDIT: The desired values would be to get all the correct counts without the NA and manually changing values to 0 if possible.
C1
# 1 0 10
H1
# 4 2 18
O1
# 1 1 2
EDIT2: Here is an example of the actual data I am importing
Meas. m/z # Ion Formula Score m/z err [mDa] err [ppm] mSigma rdb e¯ Conf Adduct
84.080700 1 C5H10N n.a. 84.080776 0.1 0.9 n.a. 2.0 even
89.060100 1 C4H9O2 n.a. 89.059706 -0.4 -4.4 n.a. 1.0 even
131.987800 1 C2H4N3P2 n.a. 131.987498 -0.3 -2.3 n.a. 6.0 even
135.081100 1 C9H11O n.a. 135.080441 -0.7 -4.9 n.a. 5.0 even
135.117500 1 C10H15 n.a. 135.116827 -0.7 -5.0 n.a. 4.0 even
136.061700 1 C5H6N5 n.a. 136.061772 0.1 0.5 n.a. 6.0 even
In the initial question i just listed V
as coming from a vector of forlumas, but what I actually have is a data.frame with other information and I use V[,3]
when performing the calculations to get the column of interest.
Here's an alternative:
vec <- c("CH4O", "H2O", "C10H18O2", "C2H4N3P2")
molecules <- regmatches(vec, gregexpr("\\b[A-Z][a-z]*\\d*", vec))
molecules <- lapply(molecules, function(a) paste0(a, ifelse(grepl("[^0-9]$", a), "1", "")))
atomcounts <- lapply(molecules, function(mol) setNames(as.integer(gsub("\\D", "", mol)), gsub("\\d", "", mol)))
atoms <- unique(unlist(sapply(atomcounts, names)))
atoms <- sapply(atoms, function(atom) sapply(atomcounts, function(a) if (atom %in% names(a)) a[atom] else 0))
rownames(atoms) <- vec
atoms
# C H O N P
# CH4O 1 4 1 0 0
# H2O 0 2 1 0 0
# C10H18O2 10 18 2 0 0
# C2H4N3P2 2 4 0 3 2