Recommended strategy for parsing ad-hoc if/else syntax in Java?

(Sorry, not sure if ad-hoc is the right word here ... open for a better suggestion)

I'm trying to parse the Galaxy ToolConfig XML CLI tool wrapper format in a Java app, for replicating (in part) the behaviour of the Galaxy software itself.

The format includes some "free-text" if/else clauses, inside the command tag (that's the only place they occur, AFAIK):

...
<command interpreter="python">
  sam_to_bam.py
    --input1=$source.input1
    --dbkey=${input1.metadata.dbkey} 
    #if $source.index_source == "history":
      --ref_file=$source.ref_file
    #else
      --ref_file="None"
    #end if
    --output1=$output1
    --index_dir=${GALAXY_DATA_INDEX_DIR}
</command>
...

What would be a recommended strategy for parsing this if/else structure into something that can be used to remodel the if/else logic in Java?

Is BNF/ANTLR overkill, better just to parse into some object structure, or? Any design patterns that would fit here? (Haven't worked with BNF/ANTLR before, but am willing to look into it if it will be worth it).

Solution

If you want to capture all the structure of the your input, a parser is the only way to go. One can code a parser manually top-down recursive, but there is little point in doing that, which is why parser generator tools exist; use them.

Regarding the #if #then #else: if that's the only structure you want to capture, then you need only a pretty primitive grammar that also allows tokens containing arbitrary text to pick up the goo between the #if#then#else constructs as a blob of text.

If you want to capture all code structure, and the conditionals are only allowed in certain places, then their existence can be simply integrated into whatever BNF you are using.

If, as I suspect, these can occur anywhere ("ad hoc"? the #if follows C preprocessor style, and those conditionals can occur virtually anywhere in the input stream), then parsing the text and retaining the conditionals is presently at the bleeding edge of what state of the art parsing can do. This is the standard C-preprocessing disease, and there have been no good solutions to this. Standard parser generators pretty can't help in this case. (Hand coded parsers don't fare better here either; the same kind of solution has to be used in either case).

One of the recent schemes (just reported as PhD research results in the last few months) to handle this is to fork the parse whenever a #if token is found to handle #if, and #else, and join when #endif is found; then you need a way to fuse to the generated subtrees typically as ambiguous subtrees marked with which arm of the conditional.

If you want to get on with your life, I suggest you simply insist that these conditionals occur in well-defined places in your grammar, and put up with the occasional complaint from people that write unstructured preprocessor directives. ("You wrote crazy code? Sorry, my tool doesn't handle it").