I am trying to write a HTTP parser using flex bison. My first thought was to lex only the most basic tokens and use the parser to combine them. For example in the HTTP header there would be tokens like GET
, POST
or PUT
as well as URL
and HTTP_VERSION
. The problem now is, that it is perfectly valid for a URL to include e.g. the string "POST"
and the lexer can not differentiate between the "POST"
in an URL
token or the "POST"
in an actual POST
token.
A solution to that problem would be to use flex start condition states, to enable and disable specific patterns dependent on what is expected. After finding a GET
, POST
, etc. token in the INITIAL
state, the lexer switches to another state where it only matches the URL
. This solution does not feel right though. It feels like implementing logic in the lexer that belongs into the parser (stuff like "there is exactly one GET
, POST
,... token followed by exactly one URL
).
That brings me to my question:
Is there a way to "synchronize" between a bison parser and a flex lexer, so that I can change the lexer state depending on what bison expects to be the next token?
P.S.: I don't know much about grammar classes and theoretical stuff behind parsing languages. It feels to me that parsing a HTTP header does not even require a contex free language parser like bison. Is HTTP a regular language, so that I can parse it with a regular expression?
HTTP is a regular language. We can see that from the fact, that it can be parsed with a finite state machine. In theory it should be possible (don't quote me on that) to match an entire HTTP message with a regular expression. The problem with this parsing approach is, that common regex languages don't have sofisticated capture options. Often we can capture single submatches and store them for later use, but there is no such thing like the callback-like approach in bison, where one can specify arbitrary actions based on the recognition of some subpattern.
The reason that bison and flex are not the right tools for this problem (as stated in other answers) lays in a very fundamental property that distuingishes the HTTP language from typical programming languages, where flex and bison are usually used:
A programming language consists of an alphabet of tokens. That means e.g. a keyword can be thought of a single element in that alphabet and the input stream of characters can unambiguously be converted into a stream of tokens. This is exactly the purpose of the lexer, which kind of acts as a translator from one alphabet to another. There are situations where the lexer additionally acts as parser and requires more internal logic, than simply translating input sequences to tokens. One of the most common of these situations are comments. A keyword in a comment no longer emits a token.
For HTTP on the other hand, the alphabet is just the input characters. It makes not much sense to use a lexer. The problem that the OP mentioned about the communication between flex and bison stems from the fact, that he tried to split the parsing of the language (which is a single problem) in two parts. Flex is not a parser and not suitable for parsing. The solution would be to just ignore the lexer stage, as it is not needed. Bison is very capable of producing a parser for a regular language and can be used to parse HTTP.
That brings me to the last point. Even if bison is capable of generating a regular language parser, it is not the most suitable tool. Bison can generate parser for context-free languages. Those are parsed using a finite state machine with a stack. For parsing a regular language, one does not need the stack, as there are no nested structures, that need to be "remembered". Thus most likely a HTTP parser generated by bison will not be as optimized as a parser generated by a regular language parser generator. Sadly, I could not find a pure regular language parser generator, but such a thing would be the preferrable tool to generate a HTTP parser.