I'm trying to come up with a way to go through about a million documents which are formal documents (for arguments sake, they are Thesis documents). They are not all standardized but close enough. They are Titles, sections, paragraphs etc. There are subtle differences that might crop up such as in english, we call a title "Title" but in French it is "Titre".
Thus in my mind the best way to do this would be to create an EBNF with all possible combinations of Title := Title | Titre for instance.
I'm not too concerned with coming up with the EBNF. My main concern is how to achieve the parsing. I've looked at ANTLR, OSLO, Irony and a slew of others but don't have the expertise in them to judge whether they would be perfect for my task.
So, my question to the learned among you is
My development platform of choice is C#. I mention this because ideally I would like to integrate the DSL tool into code so that we can work with it from existing apps.
I came across a tool called TinyPG. Its not completely what I needed but having the source code to look at will let me generate what I need.