Search code examples
javascaladbpediamediawiki-templates

Dbpedia extraction framework - how to strip mediawiki formatting markup


I'm playing around with the dbpedia extraction framework. It seems very nice, and I'm happily building ASTs of wikipedia pages and extracting links (using WikiParser). However although I get a nice structured tree from the parse, I notice that the text nodes still contain lots of formatting markup (e.g. apostrophes used for italicisation, bolding etc.). For my purposes these are not helpful - I just want the plain text.

I can spend some time writing my own code to strip this out, but I'm presuming that something like this would be useful for dbpedia - and that it exists somewhere in the library. Am I right? And if so - where is the extra functionality to strip down to bare text?

Otherwise - does anyone know of any other (preferably scala) packages to strip out mediawiki markup?

Edit

In response to a request for greater detail. The following markup:

''An italicised '''bit''' of text'', <b>Some markup</b>

Comes through dbpedia as contents of a TextNode but untouched. I would like the ability either to strip it down to:

 An italicised bit of text, Some markup

Or possibly to a more structured AST with additional nodes representing each section of raw text, perhaps annotated (on each node) with the type of formatting to be applied (e.g. italics, bold etc).

As is, the end result of a dbpedia parse is still quite full of markup.

Hope that helps.


Solution

  • So a quick look at the SimpleWikiParser source code on sourceforge suggests that as of 1/29/2011 the parser handles the following entities:

    • comments
    • references
    • code blocks
    • internal links and external links
    • properties
    • tables.

    Presumably all wiki other content ends up in TextNode objects. Looking at the wiki markup feature set, there would be a non trivial amount of work to strip out the wiki syntax elements let alone convert them further into structured elements.

    For alternative or code you can leverage, look at the following Alternate Parsers page.

    For a self contained but imperfect solution, you could perform a bunch of regular expression replace on node.text.