I'm trying to resolve ambiguity in a lowercase chemical formula. Since some element names are substrings of other element names, and they're all run together, there can be multiple global matches for the same pattern.
Considering the regex /^((h)|(s)|(hg)|(ga)|(as))+$/
against the string hgas
. There are two possible matches. hg, as
and h, s, ga
(out of order compared to input, but not an issue). Obviously a regex for all possible symbols would be longer, but this example was done for simplicity.
Regex's powerful lookahead and lookbehind allow it to conclusively determine whether even a very long string does match this pattern or there are no possible permutations of letters. It will diligently try all possible permutations of matches, and, for example, if it hits the end of the string with a leftover g
, go back through and retry a different combination.
I'm looking for a regular expression, or a language with some kind of extension, that adds on the ability to keep looking for matches after one is found, in this case, finding h, s, ga
as well as hg, as
.
Rebuilding the complex lookahead and lookbehind functionality of regex for this problem does not seem like a reasonable solution, especially considering the final regex also includes a \d* after each symbol.
I thought about reversing the order of the regexp, /^((as)|(ga)|(hg)|(s)|(h))+$/
, to find additional mappings, but at most this will only find one additional match, and I don't have the theoretical background in regex to know if it's even reasonable to try.
I've created a sample page using my existing regex which finds 1 or 0 matches for a given lowercase string and returns it properly capitalized (and out of order). It uses the first 100 chemical symbols in its matching.
http://www.ptable.com/Script/lowercase_formula.php?formula=hgas
tl;dr: I have a regex to match 0 or 1 possible chemical formula permutations in a string. How do I find more than 1 match?
I'm well-aware this answer might be off-topic (as in the approach), but I think it is quite interesting, and it solves the OP's problem.
If you don't mind learning a new language (Prolog), then it might help you generate all possible combinations:
name(X) :- member(X, ['h', 's', 'hg', 'ga', 'as']).
parse_([], []).
parse_(InList, [HeadAtom | OutTail]) :-
atom_chars(InAtom, InList),
name(HeadAtom),
atom_concat(HeadAtom, TailAtom, InAtom),
atom_chars(TailAtom, TailList),
parse_(TailList, OutTail).
parse(In, Out) :- atom_chars(In, List), parse_(List, Out).
Sample run:
?- parse('hgas', Out).
Out = [h, ga, s] ;
Out = [hg, as] ;
false.
The improved version, which includes processing for number is a tad bit longer:
isName(X) :- member(X, ['h', 's', 'hg', 'ga', 'as', 'o', 'c']).
% Collect all numbers, since it will not be part of element name.
collect([],[],[]).
collect([H|T], [], [H|T]) :-
\+ char_type(H, digit), !.
collect([H|T], [H|OT], L) :-
char_type(H, digit), !, collect(T, OT, L).
parse_([], []).
parse_(InputChars, [Token | RestTokens]) :-
atom_chars(InputAtom, InputChars),
isName(Token),
atom_concat(Token, TailAtom, InputAtom),
atom_chars(TailAtom, TailChars),
parse_(TailChars, RestTokens).
parse_(InputChars, [Token | RestTokens]) :-
InputChars = [H|_], char_type(H, digit),
collect(InputChars, NumberChars, TailChars),
atom_chars(Token, NumberChars),
parse_(TailChars, RestTokens).
parse(In, Out) :- atom_chars(In, List), parse_(List, Out).
Sample run:
?- parse('hgassc20h245o', X).
X = [h, ga, s, s, c, '20', h, '245', o] ;
X = [hg, as, s, c, '20', h, '245', o] ;
false.
?- parse('h2so4', X).
X = [h, '2', s, o, '4'] ;
false.
?- parse('hgas', X).
X = [h, ga, s] ;
X = [hg, as] ;
false.