Cleanest data structure to use when interpreting data from neatly-structured user commands (in C++)

I would like to write a simple in-house program that parses user commands written in a language of our team's own invention (but based closely on another program we are already familiar with). The command parser that I am working on now will simply be the UI through which the user can run the other algorithms I have already written. (Those other algorithms, by the way, are used to generate the input files for a molecular dynamic simulation package called LAMMPS.) The only thing I really have left to do is just write this UI, but as it turns out, writing your own scripting language is almost an intractable challenge for a non software engineer to tackle on his own.

According to the answers I received, what I am try to make would be considered a Domain Specific Language, and it is not advisable to try to make one's own DSL due to the enormous amount of work required to make it useful and bug-free.

The best option then would actually be to use an existing scripting language like Lua or Python, and embed it in the program.

To do this, I will most likely use Lua because it seems most fitting for our needs. So at this point, the rest of this question is no longer relevant since the answer would be: "Don't do it yourself." But I'm still going to keep part of it here for other users to be able read and learn from the wonderful answers below.

Thanks again to everyone who replied!

Old Question:

I would like to write a program that parses a user text input and then runs a function corresponding to that input. To do this I would need to parse the string for relevant keywords. I believe there will be less than 15 keywords when I'm done, so ideally I'd like this code to be simple and short.

The problem is that I am currently using if-statements to parse the strings. This is an extremely inconvenient way to parse commands because even for a short 3 word commands the code explodes into nested-ifs 3 layers deep. So longer 8+ word sentences will become nested-ifs more than 8 layers deep.

This kind of programing approach quickly becomes unmanageable, especially when I need to make any significant changes to a command.

My question is whether or not there exists a data structure in C++ that can help me better manage my giant nested-ifs, or if anyone could suggest a better way to parse a string for lots of different data types (i.e. substings, ints, and floats) and output an error message when the expected type is not found?

Here is an example of a short user session to show the kinds of commands I would like to interpret:
    load "Basis.Silicon" as material 1
    add material 1 to layer 1
    rotate layer 1 about x-axis by 45 degrees
    translate layer 1 in x-axis by 10 nm
    generate crystal
These commands are based on an already-existing program that our team uses, but unfortunately the source code for this program has never been publicly released so I am left guessing as to how it was actually implemented.

One final note, unlike natural language processors, I know exactly what the format of each line will be. So my issue isn't so much how to interpret the text, but rather how to code the logic in a concise and manageable way.

Thanks everyone!

Solution

Your question is not clear. And your goals are more difficult than what you believe.

Either you consider that you want to somehow process human language sentences (e.g. in English). Then you want to study natural language processing, and you can find some libraries related to that field.

Or you consider that you want to interpret some formal programming or scripting language. Then you want to study interpreters and compilers. BTW, in that case, you might just embed an existing interpreter (like Lua, Guile, Python, etc....) in your program.

You could also think in terms of expert systems with a knowledge base made of rules (this approach could be viewed as in the middle between NLP and scripting language) You'll then need some inference engine (perhaps CLIPS). See also J.Pitrat's blog.

Notice that even coding a simple interpreter is more difficult than you believe. You absolutely need to represent abstract syntax trees, which you construct from textual input with a parsing phase.

BTW, All of NLP, expert systems, and interpreter design and implementation are difficult fields. You could get a PhD in all 3 fields (but you have to choose which).

If you go the embedded interpreter way: study the interpreters I mentioned (Guile, Lua, Python, Neko, etc...) and choose which one you want, to embed.

If for whatever reason, you want to make an interpreter from scratch: Learn several programming languages first (including scripting languages like Ruby, Python, Ocaml, Scheme, Lua, Neko, ...). Read books on Programming Language Pragmatics (by M.Scott) and Lisp In Small Pieces (by Queinnec). Read also text books on compilation and parsing, and on Garbage Collection and formal (e.g. denotational) semantics. All this may need a dozen years of work.

Notice that by experience embedding a software in an interpreter is a very structuring design. If you did not thought of that at the beginning you probably need to redesign and refactor a lot your existing application. For instance, when embedding a software in an interpreter, you cannot afford that bad input crashes the program. So error handling and memory management (interfacing to the GC of the interpreter) is challenging and gives new constraints. Hence you'll need to re-think your application.

If all this is new (and even if you don't choose e.g. Guile as the embedding interpreter): learn and practice a bit of Scheme -e.g. with Guile or PltScheme- (e.g. reading SICP), read a little bit about λ-calculus and closures, then read Queinnec's Lisp In Small Pieces book. Remember the halting problem (which is partly why interpreters are difficult to code).

BTW the syntax you are proposing (e.g. rotate mat 1 by x 90) is not very readable and looks COBOL-like. If possible, have a language which looks familiar to existing ones. Make it easy to read !

^{Start by reading all the wikipages I am referencing here.}

FWIW, I am the main author of MELT, a domain specific language (inspired a lot by Scheme) to extend the GCC compiler. Some of the papers / documentations I wrote might inspire you (and contain valuable references).

Addenda (after question was reformulated)

You seems to invent some formal syntax like

add material 1 to layer 1
rotate layer 1 about x-axis by 90 degrees
translate layer 1 in x-axis by 10 inches

I can't guess what kind of language is it? Are you implementing a 3D printer? If yes, you should stick to some existing standard formal language in that domain.

I believe that such a COBOL-like syntax is really wrong. The point is that it is too verbose, and that you are wishing to implement some domain specific language. I find your example very bad-looking.

Is that syntax your invention, or is there some document specifying (and many thousands already existing lines coded in) your domain specific language. If you are just inventing it, please reconsider the syntax and the semantics. First, you need to specify on paper the full syntax and semantics of your DSL.

Is your DSL Turing complete? (I guess that yes, because Turing completeness is reached very quickly - e.g. with variables and loops....). If yes, you are inventing a scripting language. Please don't invent scripting language without knowing several programming & scripting languages (then read Programming Language Pragmatics...). The point is that, if your scripting language will become successful, advanced users will soon or later write important programs in it (e.g. many thousand lines). Then, these advanced users will be programmers. In that case, it is very important (for social & economic reasons) to have a DSL well founded and looking familiar (if possible, an extension of some existing scripting language).

If your DSL already exists, stick to its specification on paper. If that specification is not good enough, improve it with formalization (e.g. by writing some BNF syntax, and some formal (e.g. denotational) semantics for it). Publish and discuss that formalization with existing users.

Several industries got some ad-hoc DSLs which became widely used but was ill designed (e.g., in the French nuclear industry, the Gibiane DSL designed in the 1970s by nuclear physicists, not computer scientists; the US Boeing corporation is also rumored to have made similar mistakes). Then, maintaining and improving the many hundred thousands lines of DSL scripts is becoming a nightmare (and may means losing millions of dollars or euros). So you better stick to some existing scripting language. The advantages are that there exist some culture on it (e.g. you can find dozens of books on Python or Lua, and many trained engineers familiar with them), that the interpreter is widely used and tested, that the community working on them is improving the interpreters, so it has quite few uncorrected bugs.

You should not attempt to design and implement your own DSL if you are not a trained computer scientist. Stick to some existing scripting language (of course their syntax is not like you want it to be), and leverage on existing implementations and experiment.

As a counter-example, J.Ousterhout has invented the widely used Tcl scripting language, with the claim that scripts are always small (e.g. hundreds of line only) and won't grow to big code base; unfortunately, some of them did, and Tcl is known as a bad language to code many dozens of thousands of lines (even if Tcl is an easy and convenient language for tiny scripts). The moral of the story is that if a (turing complete) scripting language is becoming successful, some "crazy" advanced user will code hundred of thousands of script code. So you need that scripting language to be well designed from the start. Hence, you should adopt and adapt a good existing scripting language (and avoid inventing an unfamiliar syntax without having a good knowledge of several existing scripting languages)

later additions

^{PS: my criticism of Tcl is not entirely subjective: the point is that Tcl was designed for small scripts in mind (read J.Ousterhout's first papers about Tcl), but my point is that when you offer a Turing-complete scripting language, some "crazy" user will eventually write huge scripts for it. Hence, you need to anticipate such "crazy" usage by offering a scripting language which "scales up" to big scripts, so is built according to software engineering practices for large software code base.}

NB. Lua is probably a good choice as a language to embed. It is small, has a nice implementation, is well documented, and has good performance. But be careful about memory management issues (and this advice holds for any scripting language).