Search code examples
javaregexmediawikiwikitext

Parsing wikiText with regex in Java


Given a wikiText string such as:

{{ValueDescription
    |key=highway
    |value=secondary
    |image=Image:Meyenburg-L134.jpg
    |description=A highway linking large towns.
    |onNode=no
    |onWay=yes
    |onArea=no
    |combination=
    * {{Tag|name}}
    * {{Tag|ref}}
    |implies=
    * {{Tag|motorcar||yes}}
    }}

I'd like to parse templates ValueDescription and Tag in Java/Groovy. I tried with with regex /\{\{\s*Tag(.+)\}\}/ and it's fine (it returns |name |ref and |motorcar||yes), but /\{\{\s*ValueDescription(.+)\}\}/ doesn't work (it should return all the text above).

The expected output

Is there a way to skip nested templates in the regex?

Ideally I would rather use a simple wikiText 2 xml tool, but I couldn't find anything like that.

Thanks! Mulone


Solution

  • Create your regex pattern using Pattern.DOTALL option like this:

    Pattern p = Pattern.compile("\\{\\{\\s*ValueDescription(.+)\\}\\}", Pattern.DOTALL);
    

    Sample Code:

    Pattern p=Pattern.compile("\\{\\{\\s*ValueDescription(.+)\\}\\}",Pattern.DOTALL);
    Matcher m=p.matcher(str);
    while (m.find())
       System.out.println("Matched: [" + m.group(1) + ']');
    

    OUTPUT

    Matched: [
    |key=highway
    |value=secondary
    |image=Image:Meyenburg-L134.jpg
    |description=A highway linking large towns.
    |onNode=no
    |onWay=yes
    |onArea=no
    |combination=
    * {{Tag|name}}
    * {{Tag|ref}}
    |implies=
    * {{Tag|motorcar||yes}}
    ]
    

    Update

    Assuming closing }} appears on a separate line for {{ValueDescription following pattern will work to capture multiple ValueDescription:

    Pattern p = Pattern.compile("\\{\\{\\s*ValueDescription(.+?)\n\\}\\}", Pattern.DOTALL);