I would like to use boost::spirit in order to extract the stoichiometry of compounds made of several elements from a brute formula. Within a given compound, my parser should be able to distinguish three kind of chemical element patterns:
Those patterns are then used to parse such following compounds:
Obviously, the chemical element patterns can be in any order (e.g. CH[1]4 and H[1]4C ...) and frequencies.
I wrote my parser which is quite close to do the job but I still face one problem.
Here is my code:
template <typename Iterator>
struct ChemicalFormulaParser : qi::grammar<Iterator,isotopesMixture(),qi::locals<isotopesMixture,double>>
{
ChemicalFormulaParser(): ChemicalFormulaParser::base_type(_start)
{
namespace phx = boost::phoenix;
// Semantic action for handling the case of pure isotope
phx::function<PureIsotopeBuilder> const build_pure_isotope = PureIsotopeBuilder();
// Semantic action for handling the case of pure isotope mixture
phx::function<IsotopesMixtureBuilder> const build_isotopes_mixture = IsotopesMixtureBuilder();
// Semantic action for handling the case of natural element
phx::function<NaturalElementBuilder> const build_natural_element = NaturalElementBuilder();
phx::function<UpdateElement> const update_element = UpdateElement();
// XML database that store all the isotopes of the periodical table
ChemicalDatabaseManager<Isotope>* imgr=ChemicalDatabaseManager<Isotope>::Instance();
const auto& isotopeDatabase=imgr->getDatabase();
// Loop over the database to the spirit symbols for the isotopes names (e.g. H[1],C[14]) and the elements (e.g. H,C)
for (const auto& isotope : isotopeDatabase) {
_isotopeNames.add(isotope.second.getName(),isotope.second.getName());
_elementSymbols.add(isotope.second.getProperty<std::string>("symbol"),isotope.second.getProperty<std::string>("symbol"));
}
_mixtureToken = "{" >> +(_isotopeNames >> "(" >> qi::double_ >> ")") >> "}";
_isotopesMixtureToken = (_elementSymbols[qi::_a=qi::_1] >> _mixtureToken[qi::_b=qi::_1])[qi::_pass=build_isotopes_mixture(qi::_val,qi::_a,qi::_b)];
_pureIsotopeToken = (_isotopeNames[qi::_a=qi::_1])[qi::_pass=build_pure_isotope(qi::_val,qi::_a)];
_naturalElementToken = (_elementSymbols[qi::_a=qi::_1])[qi::_pass=build_natural_element(qi::_val,qi::_a)];
_start = +( ( (_isotopesMixtureToken | _pureIsotopeToken | _naturalElementToken)[qi::_a=qi::_1] >>
(qi::double_|qi::attr(1.0))[qi::_b=qi::_1])[qi::_pass=update_element(qi::_val,qi::_a,qi::_b)] );
}
//! Defines the rule for matching a prefix
qi::symbols<char,std::string> _isotopeNames;
qi::symbols<char,std::string> _elementSymbols;
qi::rule<Iterator,isotopesMixture()> _mixtureToken;
qi::rule<Iterator,isotopesMixture(),qi::locals<std::string,isotopesMixture>> _isotopesMixtureToken;
qi::rule<Iterator,isotopesMixture(),qi::locals<std::string>> _pureIsotopeToken;
qi::rule<Iterator,isotopesMixture(),qi::locals<std::string>> _naturalElementToken;
qi::rule<Iterator,isotopesMixture(),qi::locals<isotopesMixture,double>> _start;
};
Basically each separate element pattern can be parsed properly with their respective semantic action which produces as ouput a map between the isotopes that builds the compound and their corresponding stoichiometry. The problem starts when parsing the following compound:
CH{H[1](0.9)H[2](0.4)}
In such case the semantic action build_isotopes_mixture
return false because 0.9+0.4 is non sense for a sum of ratio. Hence I would have expected and wanted my parser to fail for this compound. However, because of the _start
rule which uses alternative operator for the three kind of chemical element pattern, the parser manages to parse it by 1) throwing away the {H[1](0.9)H[2](0.4)}
part 2) keeping the preceding H
3) parsing it using the _naturalElementToken
. Is my grammar not clear enough for being expressed as a parser ? How to use the alternative operator in such a way that, when an occurrence has been found but gave a false
when running the semantic action, the parser stops ?
How to use the alternative operator in such a way that, when an occurrence has been found but gave a false when running the semantic action, the parser stops ?
In general, you achieve this by adding an expectation point to prevent backtracking.
In this case you are actually "conflating" several tasks:
Spirit excels at matching input, has great facilities when it comes to interpreting (mostly in the sense of AST creation). However, things get "nasty" with validating on the fly.
An advice I often repeat is to consider separating the concerns whenever possible. I'd consider
This gives you the most expressive code while keeping it highly maintainable.
Because I don't understand the problem domain well enough and the code sample is not nearly complete enough to induce it, I will not try to give a full sample of what I have in mind. Instead I'll try my best at sketching the expectation point approach I mentioned at the outset.
This took the most time. (Consider doing the leg work for the people who are going to help you)
#include <boost/fusion/adapted/std_pair.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <map>
namespace qi = boost::spirit::qi;
struct DummyBuilder {
using result_type = bool;
template <typename... Ts>
bool operator()(Ts&&...) const { return true; }
};
struct PureIsotopeBuilder : DummyBuilder { };
struct IsotopesMixtureBuilder : DummyBuilder { };
struct NaturalElementBuilder : DummyBuilder { };
struct UpdateElement : DummyBuilder { };
struct Isotope {
std::string getName() const { return _name; }
Isotope(std::string const& name = "unnamed", std::string const& symbol = "?") : _name(name), _symbol(symbol) { }
template <typename T> std::string getProperty(std::string const& name) const {
if (name == "symbol")
return _symbol;
throw std::domain_error("no such property (" + name + ")");
}
private:
std::string _name, _symbol;
};
using MixComponent = std::pair<Isotope, double>;
using isotopesMixture = std::list<MixComponent>;
template <typename Isotope>
struct ChemicalDatabaseManager {
static ChemicalDatabaseManager* Instance() {
static ChemicalDatabaseManager s_instance;
return &s_instance;
}
auto& getDatabase() { return _db; }
private:
std::map<int, Isotope> _db {
{ 1, { "H[1]", "H" } },
{ 2, { "H[2]", "H" } },
{ 3, { "Carbon", "C" } },
{ 4, { "U[235]", "U" } },
};
};
template <typename Iterator>
struct ChemicalFormulaParser : qi::grammar<Iterator, isotopesMixture(), qi::locals<isotopesMixture, double> >
{
ChemicalFormulaParser(): ChemicalFormulaParser::base_type(_start)
{
using namespace qi;
namespace phx = boost::phoenix;
phx::function<PureIsotopeBuilder> build_pure_isotope; // Semantic action for handling the case of pure isotope
phx::function<IsotopesMixtureBuilder> build_isotopes_mixture; // Semantic action for handling the case of pure isotope mixture
phx::function<NaturalElementBuilder> build_natural_element; // Semantic action for handling the case of natural element
phx::function<UpdateElement> update_element;
// XML database that store all the isotopes of the periodical table
ChemicalDatabaseManager<Isotope>* imgr = ChemicalDatabaseManager<Isotope>::Instance();
const auto& isotopeDatabase=imgr->getDatabase();
// Loop over the database to the spirit symbols for the isotopes names (e.g. H[1],C[14]) and the elements (e.g. H,C)
for (const auto& isotope : isotopeDatabase) {
_isotopeNames.add(isotope.second.getName(),isotope.second.getName());
_elementSymbols.add(isotope.second.template getProperty<std::string>("symbol"),isotope.second.template getProperty<std::string>("symbol"));
}
_mixtureToken = "{" >> +(_isotopeNames >> "(" >> double_ >> ")") >> "}";
_isotopesMixtureToken = (_elementSymbols[_a=_1] >> _mixtureToken[_b=_1])[_pass=build_isotopes_mixture(_val,_a,_b)];
_pureIsotopeToken = (_isotopeNames[_a=_1])[_pass=build_pure_isotope(_val,_a)];
_naturalElementToken = (_elementSymbols[_a=_1])[_pass=build_natural_element(_val,_a)];
_start = +( ( (_isotopesMixtureToken | _pureIsotopeToken | _naturalElementToken)[_a=_1] >>
(double_|attr(1.0))[_b=_1]) [_pass=update_element(_val,_a,_b)] );
}
private:
//! Defines the rule for matching a prefix
qi::symbols<char, std::string> _isotopeNames;
qi::symbols<char, std::string> _elementSymbols;
qi::rule<Iterator, isotopesMixture()> _mixtureToken;
qi::rule<Iterator, isotopesMixture(), qi::locals<std::string, isotopesMixture> > _isotopesMixtureToken;
qi::rule<Iterator, isotopesMixture(), qi::locals<std::string> > _pureIsotopeToken;
qi::rule<Iterator, isotopesMixture(), qi::locals<std::string> > _naturalElementToken;
qi::rule<Iterator, isotopesMixture(), qi::locals<isotopesMixture, double> > _start;
};
int main() {
using It = std::string::const_iterator;
ChemicalFormulaParser<It> parser;
for (std::string const input : {
"C", // --> natural carbon made of C[12] and C[13] in natural abundance
"CH4", // --> methane made of natural carbon and hydrogen
"C2H{H[1](0.8)H[2](0.2)}6", // --> ethane made of natural C and non-natural H made of 80% of hydrogen and 20% of deuterium
"C2H{H[1](0.9)H[2](0.2)}6", // --> invalid mixture (total is 110%?)
"U[235]", // --> pure uranium 235
})
{
std::cout << " ============= '" << input << "' ===========\n";
It f = input.begin(), l = input.end();
isotopesMixture mixture;
bool ok = qi::parse(f, l, parser, mixture);
if (ok)
std::cout << "Parsed successfully\n";
else
std::cout << "Parse failure\n";
if (f != l)
std::cout << "Remaining input unparsed: '" << std::string(f, l) << "'\n";
}
}
Which, as given, just prints
============= 'C' ===========
Parsed successfully
============= 'CH4' ===========
Parsed successfully
============= 'C2H{H[1](0.8)H[2](0.2)}6' ===========
Parsed successfully
============= 'C2H{H[1](0.9)H[2](0.2)}6' ===========
Parsed successfully
============= 'U[235]' ===========
Parsed successfully
no need for the locals, just use the regular placeholders:
_mixtureToken = "{" >> +(_isotopeNames >> "(" >> double_ >> ")") >> "}";
_isotopesMixtureToken = (_elementSymbols >> _mixtureToken) [ _pass=build_isotopes_mixture(_val, _1, _2) ];
_pureIsotopeToken = _isotopeNames [ _pass=build_pure_isotope(_val, _1) ];
_naturalElementToken = _elementSymbols [ _pass=build_natural_element(_val, _1) ];
_start = +(
( (_isotopesMixtureToken | _pureIsotopeToken | _naturalElementToken) >>
(double_|attr(1.0)) ) [ _pass=update_element(_val, _1, _2) ]
);
// ....
qi::rule<Iterator, isotopesMixture()> _mixtureToken;
qi::rule<Iterator, isotopesMixture()> _isotopesMixtureToken;
qi::rule<Iterator, isotopesMixture()> _pureIsotopeToken;
qi::rule<Iterator, isotopesMixture()> _naturalElementToken;
qi::rule<Iterator, isotopesMixture()> _start;
you will want to handle conflicts between names/symbols (possibly just by prioritizing one or the other)
conforming compilers will require the template
qualifier (unless I totally mis-guessed your datastructure, in which case I don't know what the template argument to ChemicalDatabaseManager
was supposed to mean).
Hint, MSVC is not a standards-conforming compiler
Assuming that the "weights" need to add up to 100% inside the _mixtureToken
rule, we can either make build_isotopes_micture
"not dummy" and add the validation:
struct IsotopesMixtureBuilder {
bool operator()(isotopesMixture&/* output*/, std::string const&/* elementSymbol*/, isotopesMixture const& mixture) const {
using namespace boost::adaptors;
// validate weights total only
return std::abs(1.0 - boost::accumulate(mixture | map_values, 0.0)) < 0.00001;
}
};
However, as you note, it will thwart things by backtracking. Instead you might /assert/ that any complete mixture add up to 100%:
_mixtureToken = "{" >> +(_isotopeNames >> "(" >> double_ >> ")") >> "}" > eps(validate_weight_total(_val));
With something like
struct ValidateWeightTotal {
bool operator()(isotopesMixture const& mixture) const {
using namespace boost::adaptors;
bool ok = std::abs(1.0 - boost::accumulate(mixture | map_values, 0.0)) < 0.00001;
return ok;
// or perhaps just :
return ok? ok : throw InconsistentsWeights {};
}
struct InconsistentsWeights : virtual std::runtime_error {
InconsistentsWeights() : std::runtime_error("InconsistentsWeights") {}
};
};
#include <boost/fusion/adapted/std_pair.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/range/adaptors.hpp>
#include <boost/range/numeric.hpp>
#include <map>
namespace qi = boost::spirit::qi;
struct DummyBuilder {
using result_type = bool;
template <typename... Ts>
bool operator()(Ts&&...) const { return true; }
};
struct PureIsotopeBuilder : DummyBuilder { };
struct NaturalElementBuilder : DummyBuilder { };
struct UpdateElement : DummyBuilder { };
struct Isotope {
std::string getName() const { return _name; }
Isotope(std::string const& name = "unnamed", std::string const& symbol = "?") : _name(name), _symbol(symbol) { }
template <typename T> std::string getProperty(std::string const& name) const {
if (name == "symbol")
return _symbol;
throw std::domain_error("no such property (" + name + ")");
}
private:
std::string _name, _symbol;
};
using MixComponent = std::pair<Isotope, double>;
using isotopesMixture = std::list<MixComponent>;
struct IsotopesMixtureBuilder {
bool operator()(isotopesMixture&/* output*/, std::string const&/* elementSymbol*/, isotopesMixture const& mixture) const {
using namespace boost::adaptors;
// validate weights total only
return std::abs(1.0 - boost::accumulate(mixture | map_values, 0.0)) < 0.00001;
}
};
struct ValidateWeightTotal {
bool operator()(isotopesMixture const& mixture) const {
using namespace boost::adaptors;
bool ok = std::abs(1.0 - boost::accumulate(mixture | map_values, 0.0)) < 0.00001;
return ok;
// or perhaps just :
return ok? ok : throw InconsistentsWeights {};
}
struct InconsistentsWeights : virtual std::runtime_error {
InconsistentsWeights() : std::runtime_error("InconsistentsWeights") {}
};
};
template <typename Isotope>
struct ChemicalDatabaseManager {
static ChemicalDatabaseManager* Instance() {
static ChemicalDatabaseManager s_instance;
return &s_instance;
}
auto& getDatabase() { return _db; }
private:
std::map<int, Isotope> _db {
{ 1, { "H[1]", "H" } },
{ 2, { "H[2]", "H" } },
{ 3, { "Carbon", "C" } },
{ 4, { "U[235]", "U" } },
};
};
template <typename Iterator>
struct ChemicalFormulaParser : qi::grammar<Iterator, isotopesMixture()>
{
ChemicalFormulaParser(): ChemicalFormulaParser::base_type(_start)
{
using namespace qi;
namespace phx = boost::phoenix;
phx::function<PureIsotopeBuilder> build_pure_isotope; // Semantic action for handling the case of pure isotope
phx::function<IsotopesMixtureBuilder> build_isotopes_mixture; // Semantic action for handling the case of pure isotope mixture
phx::function<NaturalElementBuilder> build_natural_element; // Semantic action for handling the case of natural element
phx::function<UpdateElement> update_element;
phx::function<ValidateWeightTotal> validate_weight_total;
// XML database that store all the isotopes of the periodical table
ChemicalDatabaseManager<Isotope>* imgr = ChemicalDatabaseManager<Isotope>::Instance();
const auto& isotopeDatabase=imgr->getDatabase();
// Loop over the database to the spirit symbols for the isotopes names (e.g. H[1],C[14]) and the elements (e.g. H,C)
for (const auto& isotope : isotopeDatabase) {
_isotopeNames.add(isotope.second.getName(),isotope.second.getName());
_elementSymbols.add(isotope.second.template getProperty<std::string>("symbol"), isotope.second.template getProperty<std::string>("symbol"));
}
_mixtureToken = "{" >> +(_isotopeNames >> "(" >> double_ >> ")") >> "}" > eps(validate_weight_total(_val));
_isotopesMixtureToken = (_elementSymbols >> _mixtureToken) [ _pass=build_isotopes_mixture(_val, _1, _2) ];
_pureIsotopeToken = _isotopeNames [ _pass=build_pure_isotope(_val, _1) ];
_naturalElementToken = _elementSymbols [ _pass=build_natural_element(_val, _1) ];
_start = +(
( (_isotopesMixtureToken | _pureIsotopeToken | _naturalElementToken) >>
(double_|attr(1.0)) ) [ _pass=update_element(_val, _1, _2) ]
);
}
private:
//! Defines the rule for matching a prefix
qi::symbols<char, std::string> _isotopeNames;
qi::symbols<char, std::string> _elementSymbols;
qi::rule<Iterator, isotopesMixture()> _mixtureToken;
qi::rule<Iterator, isotopesMixture()> _isotopesMixtureToken;
qi::rule<Iterator, isotopesMixture()> _pureIsotopeToken;
qi::rule<Iterator, isotopesMixture()> _naturalElementToken;
qi::rule<Iterator, isotopesMixture()> _start;
};
int main() {
using It = std::string::const_iterator;
ChemicalFormulaParser<It> parser;
for (std::string const input : {
"C", // --> natural carbon made of C[12] and C[13] in natural abundance
"CH4", // --> methane made of natural carbon and hydrogen
"C2H{H[1](0.8)H[2](0.2)}6", // --> ethane made of natural C and non-natural H made of 80% of hydrogen and 20% of deuterium
"C2H{H[1](0.9)H[2](0.2)}6", // --> invalid mixture (total is 110%?)
"U[235]", // --> pure uranium 235
}) try
{
std::cout << " ============= '" << input << "' ===========\n";
It f = input.begin(), l = input.end();
isotopesMixture mixture;
bool ok = qi::parse(f, l, parser, mixture);
if (ok)
std::cout << "Parsed successfully\n";
else
std::cout << "Parse failure\n";
if (f != l)
std::cout << "Remaining input unparsed: '" << std::string(f, l) << "'\n";
} catch(std::exception const& e) {
std::cout << "Caught exception '" << e.what() << "'\n";
}
}
Prints
============= 'C' ===========
Parsed successfully
============= 'CH4' ===========
Parsed successfully
============= 'C2H{H[1](0.8)H[2](0.2)}6' ===========
Parsed successfully
============= 'C2H{H[1](0.9)H[2](0.2)}6' ===========
Caught exception 'boost::spirit::qi::expectation_failure'
============= 'U[235]' ===========
Parsed successfully