Search code examples
javaregexparsingwikitext

Java: Regex to delete parts of an XML file


I am reading a wikipedia XML file, in which i have to delete anything between curly braces. E.g. For the following string:

String text = "{{Use dmy dates|date=November 2012}} {{Infobox musical artist <!-- See Wikipedia:WikiProject_Musicians --> | name
= Russ Conway | image = | caption = Russ Conway, pictured on the front of his 1959 [[Extended play|EP]] ''More Party Pops''. | image_size = | background = non_vocal_instrumentalist | birth_name = Trevor Herbert Stanford | alias = | birth_date = {{birth date|1925|09|2|df=y}} | birth_place = [[Bristol]], [[England]], UK | death_date = {{death date and age|2000|11|16|1925|09|02|df=y}} | death_place = [[Eastbourne]], [[Sussex]], England, UK | origin = | instrument = [[Piano]] | genre = | occupation = [[Musician]] | years_active = | label = EMI (Columbia), Pye, MusicMedia, Churchill | associated_acts = | website = | notable_instruments = }}";

It should be replaced with an empty string. Notice, that the example has multiple new lines and nested {{...}}

I am using the following code:

Pattern p1 = Pattern.compile(".*\\({\\{.+\\}\\}).*", Pattern.DOTALL);
Matcher m1 = p1.matcher(text);

while(m1.find()){

String text1 = text.replaceAll(m1.group(1), "");
}

I am new to regex, can you please tell what i am doing wrong?


Solution

  • This is not generally possible with a regular expression. Regular languages cannot describe arbitrary levels of nesting, because they have no way to "count" what level they're at.

    If you absolutely must use regex, you could create a cumbersome expression that would work for up to e.g. three levels of nesting, by encoding all the nesting possibilities manually. But this would be extremely cumbersome, would effectively be a violation of DRY, and is nowhere near the right tool for the job.

    It would likely be easier to do this "by hand", if needs be. Scan across the string yourself, and every time you hit a {{ increase the "brace level"; every time you hit a }} decrease it. Copy each character to the output if and only if the brace level is zero.

    Something like (untested):

    StringBuilder output = new StringBuilder();
    char[] input = text.toCharArray();
    int braceLevel = 0;
    for (int i = 0; i < input.length; i++) {
       final char c = input[i];
       if (c == '{') {
          // Check for {{
          if (i < input.length - 1 && input[i+1] == '{') {
             // Yep, it's a double brace - increase the level, consume
             // the second character and continue with the next char
             braceLevel++;
             i++;
             continue;
          }
       }
       else if (c == '}' && braceLevel > 0) {
          // Check for a closing brace similar to above
          if (i < input.length - 1 && input[i+1] == '}') {
             braceLevel--;
             i++;
             continue;
          }
       }
    
       if (braceLevel == 0) {
          output.append(c);
       }
    }
    
    // Now output contains every character that was not contained within brackets