Search code examples
javaregexparsingbibtex

Parsing BibTeX record with Java RegEx


I have to write simple BibTeX parser using Java regular expressions. Task is a bit simplified: every tag value is between quotation marks "", not brackets {}. The thing is, {} can be inside "".

I'm trying to cut single records from entire String file, e. g. I want to get @book{...} as String. The problem is that there can be no comma after last tag, so it can end like: author = "john"}.

I've tried @\w*\{[\s\S]*?\}, but it stops if I have } in any tag value between "". There is also no guarantee that } will be in separate line, it can be directly after last tag value (which may not end with " either, since it can be an integer).

Can you help me with this?


Solution

  • You could try the following expression as a basis: @\w+\{(?>\s*\w+\s*=\s*"[^"]*")*\}

    Exlanation:

    • @\w+\{...\} would be the record, e.g. @book{...}
    • (?>...)* means a non-capturing group that can occur multiple times or not at all - this is meant to represent the tags
    • \s*\w+\s*=\s*"[^"]*" would mean a tag which could be preceded by whitespace (\s*). The tag's value has to be in double quotes and anything between double quotes will be consumed, even curly braces.

    Note that there might be some more cases to take into account but this should be able to handle curly braces in tag values because it will "consume" every content between the double quotes, thus it wouldn't match if the closing curly brace were missing (e.g. it would match @book{ title="the use of { and }" author="John {curly} Johnson"} but not @book{ title="the use of { and }" author="John {curly} Johnson").