Search code examples
c++yaccflex-lexeranalyzerlexical

Lexical and syntax analyzer software


I am designing a custom language based on CSS-ish (CSS+custom extension) which basically would work like this:

[object.member.value = 5]{
object.member.anothervalue:8 
object.member.yetanothervalue:'hello'
object.member.yetyetanothervalue.anothervalue:blue
}

Basically the language allows to check for some conditions (if's, can be nested) and then apply some values to the object. There are no loops. This would be stored in plain text files and loaded into the application (C++) at starting time.

The idea is to translate this CSS-ish file into a C++ tree or something similar, which can be evaluated at runtime.

I am considering using some lexical analyzers and tokenizers (Yacc, Flex, Bison, etc...).

What would be your suggestion of tools / libraries to use?


Solution

  • If you expect to do this sort of thing more than once, learn how to use parser generators. It will save you a lot of pain in the long run.

    Start simple. The tools will do lots of things for you, and generally with very little effort. Let them do that. Get things working before you try to do things which are complex.

    The rest of this assumes that you will use flex and bison (which are lex and yacc lookalikes.) You don't have to; there are many alternatives. If you decide to try one of the other alternatives, ignore the rest of this answer.

    But flex and bison are solid, well-maintained, well-debugged packages with a lot of documentation, and they've been used extensively over a long period of time. Read the documentation first.

    • flex will read from standard input or a provided file descriptor automatically. Let it do that.
    • flex will track line numbers for you. Let it do that.
    • bison will generate token numbers for you automatically. Let it do that.
    • bison and flex are optimized for single-character tokens. Not only do you not need to provide token numbers, you don't even need to provide token names. In your flex file, just put this at or near the end:

      . { return yytext[0]; }
      

      and don't bother writing rules to handle single-character tokens. Don't worry about the fact that you will tokenize illegal characters; bison will produce an error message for you.

    • However, don't allow flex to insert a default rule. (%option nodefault is enough to suppress it.)

    A couple of other tips:

    • Even though yytext is a global, pretend that it isn't. You must copy any string which is needed for further processing. strdup is your friend; use it rather than messing around with malloc and strcpy. Use asprintf as well; char* out; asprintf(&out, "%s%s%s", s1, s2, s3); is far and away the easiest way to concatenate three strings. There are easily available unrestricted implementations for platforms which don't have these things, so don't worry about the "but they're not Posix/Standard C/yadda yadda yadda" arguments. And don't even think about fixed-length buffers. You don't need them. Honest.
    • On the other hand, if a token can be processed in the scanner, do it there. Numbers, for example; it's much easier to do the strtol once in the scanner, and then you don't even need to think about memory allocations.
    • Don't forget to free() strings when you don't need them any more, but if you find that difficult start by leaking memory and then fix things after you have your parser working. (I know some people will find that sacrilegious, but as long as you remember to do it before production, it's fine; you'll feel a lot more motivated once you have the basics working.)

    And finally:

    • Use a reasonably current version of bison. If you find yourself with mysterious shift/reduce conflicts, use a glr parser: yes, it's a bit slower, but if it saves you some pain, it's worth it. You can always go back and fix things up later. (GLR parsers won't save you from all grammar problems. You still need to make sure your grammar isn't ambiguous. But they can help.)
    • My personal recommendation: Use the C interfaces. It's ok to compile with C++ and you can use standard C++ containers and other nice features; just don't use them in your semantic values because that doesn't play nicely with bison's internal stack management. (Pointers to C++ containers are just fine, though.) And remember that flex and bison are just control flow; the bulk of your program is going to be written in C/C++, so you're not entering a new world by using the compiler tools. You're also not getting a free pass: you need to know how to use C/C++ before you start writing your parser.

    Hope that helps. Good luck.