Search code examples
pythonregexwiki

Python regular expression with wiki text


I'm trying to change wikitext into normal text using Python regular expressions substitution. There are two formatting rules regarding wiki link.

  • [[Name of page]]
  • [[Name of page | Text to display]]

    (http://en.wikipedia.org/wiki/Wikipedia:Cheatsheet)

Here is some text that gives me a headache.

The CD is composed almost entirely of [[cover version]]s of [[The Beatles]] songs which George Martin [[record producer|produced]] originally.

The text above should be changed into:

The CD is composed almost entirely of cover versions of The Beatles songs which George Martin produced originally.

The conflict between [[ ]] and [[ | ]] grammar is my main problem. I don't need one complex regular expression. Applying multiple (maybe two) regular expression substitution(s) in sequence is ok.

Please enlighten me on this problem.


Solution

  • wikilink_rx = re.compile(r'\[\[(?:[^|\]]*\|)?([^\]]+)\]\]')
    return wikilink_rx.sub(r'\1', the_string)
    

    Example: http://ideone.com/7oxuz

    Note: you may also find some MediaWiki parsers in http://www.mediawiki.org/wiki/Alternative_parsers.