Search code examples
javaregexparsingwiki-markup

Java: Regex to delete wiki markup of lists


I am reading a wikipedia XML file, in which i have to delete anything which is a list item. E.g. For the following string:

String text = ": definition list\n
** some list item\n
# another list item\n
[[Category:1918 births]]\n
[[Category:2005 deaths]]\n
[[Category:Scottish female singers]]\n
[[Category:Billy Cotton Band Show]]\n
[[Category:Deaths from Alzheimer's disease]]\n
[[Category:People from Glasgow]]";

Here, i want to delete the *,# and :, but not the one where it says category. Output should look like:

String outtext = "definition list\n
some list item\n
another list item\n
[[Category:1918 births]]\n
[[Category:2005 deaths]]\n
[[Category:Scottish female singers]]\n
[[Category:Billy Cotton Band Show]]\n
[[Category:Deaths from Alzheimer's disease]]\n
[[Category:People from Glasgow]]";

I am using the following code:

Pattern pattern = Pattern.compile("(^\\*+|#+|;|:)(.+)$");
            Matcher matcher = pattern.matcher(text);
            while (matcher.find()) {
                String outtext = matcher.group(0);
                outtext = outtext.replaceAll("(^\\*+|#+|;|:)\\s", "");
                return(outtext);
                } 

This is not working. Can you please indicate how i should do it?


Solution

  • This should work:

    text = text.replaceAll("(?m)^[*:#]+\\s*", "");
    

    Important is using (?m) for MULTILINE mode here that lets you use line start/end anchors for each line.

    OUTPUT:

    definition list
    some list item
    another list item
    [[Category:1918 births]]
    [[Category:2005 deaths]]
    [[Category:Scottish female singers]]
    [[Category:Billy Cotton Band Show]]
    [[Category:Deaths from Alzheimer's disease]]
    [[Category:People from Glasgow]]