Search code examples
javaparsingwikiwikipedia

Wikipedia : Java library to remove wikipedia text markup removal


I downloaded wikipedia dump and now want to remove the wikipedia markup in the contents of each page. I tried writing regular expressions but they are too many to handle. I found a python library but I need a java library because, I want to integrate into my code.

Thank you.


Solution

  • Do it in two steps:

    1. let some existing tool convert the MediaWiki mark-up into plain HTML;
    2. convert the plain HTML into text.

    The following demo:

    import net.java.textilej.parser.MarkupParser;
    import net.java.textilej.parser.builder.HtmlDocumentBuilder;
    import net.java.textilej.parser.markup.mediawiki.MediaWikiDialect;
    import javax.swing.text.html.HTMLEditorKit;
    import javax.swing.text.html.parser.ParserDelegator;
    import java.io.StringReader;
    import java.io.StringWriter;
    
    public class Test {
    
        public static void main(String[] args) throws Exception {
    
            String markup = "This is ''italic'' and '''that''' is bold. \n"+
                    "=Header 1=\n"+
                    "a list: \n* item A \n* item B \n* item C";
    
            StringWriter writer = new StringWriter();
    
            HtmlDocumentBuilder builder = new HtmlDocumentBuilder(writer);
            builder.setEmitAsDocument(false);
    
            MarkupParser parser = new MarkupParser(new MediaWikiDialect());
            parser.setBuilder(builder);
            parser.parse(markup);
    
            final String html = writer.toString();
            final StringBuilder cleaned = new StringBuilder();
    
            HTMLEditorKit.ParserCallback callback = new HTMLEditorKit.ParserCallback() {
                    public void handleText(char[] data, int pos) {
                        cleaned.append(new String(data)).append(' ');
                    }
            };
            new ParserDelegator().parse(new StringReader(html), callback, false);
    
            System.out.println(markup);
            System.out.println("---------------------------");
            System.out.println(html);
            System.out.println("---------------------------");
            System.out.println(cleaned);
        }
    }
    

    produces:

    This is ''italic'' and '''that''' is bold. 
    =Header 1=
    a list: 
    * item A 
    * item B 
    * item C
    ---------------------------
    <p>This is <i>italic</i> and <b>that</b> is bold. </p><h1 id="Header1">Header 1</h1><p>a list: </p><ul><li>item A </li><li>item B </li><li>item C</li></ul>
    ---------------------------
    This is  italic  and  that  is bold. Header 1 a list: item A item B item C 
    

    Where do you download the java packages you are importing?

    Here: Web Archive link of download.java.net/maven/2/net/java/textile-j/2.2