Search code examples
c++regexpcreboost-regex

Parsing a zero-width regex with a regex


We use zero-width regex strings to specify the places in a string of amino acid symbols (basically A-Z) that are valid cleavage sites. For example, the proteolytic enzyme trypsin cleaves after K or R except when followed by P ((?<=[KR])(?!P)). I want to convert these regexes to the "cut/no-cut" notation also common in this field. For example, trypsin cuts after "KR" with a no-cut of "P". My first attempt at this works for simple cases:

// match zero or one regex term like (?<=[KR]) or (?<=K) or (?<![KR]) or (?<!K)
// followed by zero or one term like (?=[KR]) or (?=K) or (?![KR]) or (?!K)
boost::regex cutNoCutRegex("(?:\\(+\\?<([=!])(\\[[A-Z]+\\]|[A-Z])\\)+)?(?:\\(+\\?([=!])(\\[[A-Z]+\\]|[A-Z])\\)+)?");

Without the C++ escaping, that's:

(?:\(+\?<([=!])(\[[A-Z]+\]|[A-Z])\)+)?(?:\(+\?([=!])(\[[A-Z]+\]|[A-Z])\)+)?

I'd like to change this to support somewhat more complicated regexes, such as multiple characters, non-capturing groups, character sets, ranges in character sets, negated sets, and start/end of string: (?<=K|R) or (?<=(?:K)|(?:R)) or (?<=[^A-JL-QS-Z]) or (?<=^M|[KR])

These extra features would seem to explode the complexity of the regex. I'm pretty sure I'll need to enable the "experimental" BOOST_REGEX_MATCH_EXTRA feature of Boost.Regex. Is there a better way to do what I'm doing? Am I missing some other regex possibilities in zero-width regexes?

Here is pseudo-code for my unit tests for the existing code covering many of the simple cases. The "sense" member is "C" when the "cut" field corresponds to the look-behind, and "N" when the "cut" field corresponds to the look-ahead. The current pepXMLSpecificity() function can invert the character set if it would produce a shorter list.

struct PepXMLSpecificity { std::string cut, no_cut, sense; };
void unit_assert_equal(string expected, string actual);

"(?<=[QWERTY])(?=[QWERTY])"
result = pepXMLSpecificity(ez);
unit_assert_equal("C", result.sense);
unit_assert_equal("QWERTY", result.cut);
unit_assert_equal("ABCDFGHIJKLMNOPSUVZ", result.no_cut);

"(?<![QWERTY])(?![QWERTY])"
result = pepXMLSpecificity(ez);
unit_assert_equal("C", result.sense);
unit_assert_equal("ABCDFGHIJKLMNOPSUVZ", result.cut);
unit_assert_equal("QWERTY", result.no_cut);

"(?<=[QWERTY])"
result = pepXMLSpecificity(ez);
unit_assert_equal("C", result.sense);
unit_assert_equal("QWERTY", result.cut);
unit_assert_equal("", result.no_cut);

"(?=[QWERTY])"
result = pepXMLSpecificity(ez);
unit_assert_equal("N", result.sense);
unit_assert_equal("QWERTY", result.cut);
unit_assert_equal("", result.no_cut);

"(?<![QWERTY])"
result = pepXMLSpecificity(ez);
unit_assert_equal("C", result.sense);
unit_assert_equal("ABCDFGHIJKLMNOPSUVZ", result.cut);
unit_assert_equal("", result.no_cut);

"(?![QWERTY])"
result = pepXMLSpecificity(ez);
unit_assert_equal("N", result.sense);
unit_assert_equal("ABCDFGHIJKLMNOPSUVZ", result.cut);
unit_assert_equal("", result.no_cut);

// the following tests aren't supported yet

"(?<=^M)|(?<=[KR])"
unit_assert_equal("N", result.sense);
unit_assert_equal("KR", result.cut); // the 'M' part is dropped
unit_assert_equal("", result.no_cut);

"(?<=K|R)"
unit_assert_equal("C", result.sense);
unit_assert_equal("KR", result.cut);
unit_assert_equal("", result.no_cut);

"(?<=(?:K)|(?:R))"
unit_assert_equal("C", result.sense);
unit_assert_equal("KR", result.cut);
unit_assert_equal("", result.no_cut);

"(?<=[^A-JL-QS-Z])(?!P)"
unit_assert_equal("C", result.sense);
unit_assert_equal("KR", result.cut);
unit_assert_equal("P", result.no_cut);

Solution

  • As I understand it the situation is, you have a library of existing regexes which, when applied to common string representations of aminio-acid sequences, identify probable cut-points for a proteolytic enzyme.

    You want to automatically produce a standard textual description of the cutpoints implied by the regex.

    Observations:

    • You don't need to be able to parse arbitrary regexes, you only need to be able to parse the cases that you actually have in your library.

    • You don't necessarily need to parse all of them. Particularly difficult ones could be kicked out and done by hand provided there weren't too many of them.

    Really I think you need to do the following.

    1. pepXMLSpecificity needs to return one or more descriptions, i.e. a vector<struct PepXMLSpecificity>, since a regex can be authored to combine arbitrary regexes cf. jpalacek's comment.

    2. You should tackle the actual contents of your library of regexes starting with the common cases and working down, and just add special cases for each common type of regex until you have got them all (or at least enough of them to satisfy your boss).