Search code examples
haskellghchappy

Pushing back tokens in Happy and Alex


I'm parsing a language which has both < and <<. In my Alex definition I've got something that contains something like

tokens :-

"<"             { token Lt }
"<<"            { token (BinOp Shl) }

so whenever I encounter <<, that gets tokenized as a left shift and not as to less-than's. This is generally a good thing, since I end up throwing out whitespace after tokenization and want to differentiate between 1 < < 2 and 1 << 2. However, there are other times I wish << had been read as two <. For example, I have things like

<<A>::B> 

which I want read like

< < A > :: B >

Obviously I can try to adjust my Happy parser rules to accommodate for the extra cases, but that scales badly. In other imperative parser generators, I might try to do something like push back "part" of the token (something like push_back("<") when I encountered << but I only needed <).

Has anyone else had such a problem and, if so, how did you deal with it? Are there ways of "pushing back" tokens in Happy? Should I instead try to keep a whitespace token around (I'm actually leaning towards the last alternative - although being a huge headache, it would let me deal with << by just making sure there is no whitespace between the two <).


Solution

  • While I initially went with @Jon's answer, I ended up running into a variety of precedence related issues (think precedence around expr < expr vs expr << expr) which caused me a lot of headaches. I recently (successfully) back to lexing << as one token. The solution was twofold:

    1. I bit the bullet and added extra rules for << (where previously I only had rules for <). For the example in the question (<<A>::B>) my rule went from something like

      ty_qual_path
        : '<' ty_sum '>' '::' ident
      

      to

      ty_qual_path
        : '<' ty_sum '>' '::' ident
        | '<<' ty_sum '>' '::' ident '>' '::' ident
      

      (The actual rule was actually a bit more involved, but that is not for this answer).

    2. I found a clever way to deal with token that started with > (these would cause problems around things like vector<i32,vector<i32>> where the last >> was a token): use a threaded lexer (section 2.5.2), exploit the {%% ... } RHS of rules which lets you reconsider the lookahead token, and add a pushToken facility to my parser monad (this turned out to be quite simple - here is exactly what I did). I then added a dummy rule - something like

      gt :: { () }
        : {- empty -}   {%% \tok ->
            case tok of
              Tok ">>"  -> pushToken (Tok ">")  *> pushToken (Tok ">")
              Tok ">="  -> pushToken (Tok "=")  *> pushToken (Tok ">")
              Tok ">>=" -> pushToken (Tok ">=") *> pushToken (Tok ">")
              _         -> pushToken tok
          }
      

      And every time in some other rule I expected a > but there could also be any other token starting with >, I would precede the > token with gt. This has the effect of looking ahead to the next token which may could start with > without being >, and try to convert that token into one > token and another token for the "rest" of the initial token.