Search code examples
javaregexhyperlinkwikiwikitext

Remove wikitext hyperlinks via regex


There are two different kinds of wikitext hyperlinks:

[[stack]]
[[heap (memory region)|heap]]

I would like to remove the hyperlinks but keep the text:

stack
heap

Currently, I am running two phases, employing two different regular expressions:

public class LinkRemover
{
    private static final Pattern
    renamingLinks = Pattern.compile("\\[\\[[^\\]]+?\\|(.+?)\\]\\]");

    private static final Pattern
    simpleLinks = Pattern.compile("\\[\\[(.+?)\\]\\]");

    public static String removeLinks(String input)
    {
        String temp = renamingLinks.matcher(input).replaceAll("$1");
        return simpleLinks.matcher(temp).replaceAll("$1");
    }
}

Is there a way to "fuse" the two regular expressions into one, achieving the same result?

If you want to check your proposed solutions for correctness, here is a simple test class:

public class LinkRemoverTest
{
    @Test
    public void test()
    {
        String input = "A sheep's [[wool]] is the most widely used animal fiber, and is usually harvested by [[Sheep shearing|shearing]].";
        String expected = "A sheep's wool is the most widely used animal fiber, and is usually harvested by shearing.";
        String output = LinkRemover.removeLinks(input);
        assertEquals(expected, output);
    }
}

Solution

  • You can make the part until the pipe optional:

    \\[\\[(?:[^\\]|]*\\|)?([^\\]]+)\\]\\]
    

    And to be sure you are always between square brackets, use the character classes.

    fiddle (click the Java button)

    pattern details:

    \\[\\[         # literals opening square brackets
    (?:            # open a non-capturing group
        [^\\]|]*   # zero or more characters that are not a ] or a |
        \\|        # literal |
    )?             # make the group optional
    ([^\\]]+)      # capture all until the closing square bracket
    \\]\\]