I want to extract the infobox block from Wikipedia. Below is a sample input file:
{{some text}}
some other text
{{Infobox President
birth|d/m/y
other_inner_text:{{may contain curly bracket}}
other text}}
some other text
or even another infobox
{{Infobox Cabinet
same structure
{{text}}also can contain {{}}
}}
can be some other text...
I want the parsing result to return the two Infobox blocks:
{{Infobox President
birth|d/m/y
other_inner_text:{{may contain curly bracket}}
other text
}}
and
{{Infobox Cabinet
same structure
{{text}}also can contain {{}}
}}
Any one know how to use regular expression in python to achieve this?
Regex
{{Infobox(?:(?!}}|{{).)*(?:{{(?:(?!}}|{{).)*}}(?:(?!}}|{{).)*)*.*?}}
And my try at Perl which I'm not fluent at
while ($subject =~ m/\{\{Infobox(?:(?!\}\}|\{\{).)*(?:\{\{(?:(?!\}\}|\{\{).)*\}\}(?:(?!\}\}|\{\{).)*)*.*?\}\}/sg) {
# matched text = $&
}
It will work on an unlimited pair of "{{ some text }}" as long as they are balanced. It does not support nested text of that pair but it wasn't required.
Note that it's maybe better to look for an alternative solution if this is not used in a 1 time only solution. Maintaining such a regex is brutal.