Search code examples
javaregextokenize

Tokenize Words separated by non-word characters exept single quote


I have the following method I'm trying to implement: parses the input into “word tokens”: sequences of word characters separated by non-word characters. However, non-word characters can become part of a token if they are quoted (in single quotes).
I want to use regex but have trouble getting my code just right:

public static List<String> wordTokenize(String input) {
    Pattern pattern = Pattern.compile ("\\b(?:(?<=\')[^\']*(?=\')|\\w+)\\b");
    Matcher matcher = pattern.matcher (input);
    ArrayList ans = new ArrayList();
    while (matcher.find ()){
        ans.add (matcher.group ());
    }
    return ans;
}

My regex fails to identify that starting a word mid word without space doesn't mean starting a new word. Examples:

  1. The input: this-string 'has only three tokens' // works

  2. The input: "this*string'has only two@tokens'"
    Expected :[this, stringhas only two@tokens]
    Actual :[this, string, has only two@tokens]

  3. The input: "one'two''three' '' four 'twenty-one'"
    Expected :[onetwothree, , four, twenty-one]
    Actual :[one, two, three, four, twenty-one]

How do I fix the spaces?


Solution

  • You want to match one or more occurrences of a word char or a substring between the closest single straight apostrophes, and remove all those apostrophes from the tokens.

    Use the following regex and .replace("'", "") on the matches:

    (?:\w|'[^']*')+
    

    See the regex demo. Details:

    • (?: - start of a non-capturing group
      • \w - a word char
      • | - or
      • ' - a straight single quotation mark
      • [^']* - any 0+ chars other than a straight single quotation mark
      • ' - a straight single quotation mark
    • )+ - end of the group, 1+ occurrences.

    See the Java demo:

    // String s = "this*string'has only two@tokens'"; // => [this, stringhas only two@tokens]
    String s = "one'two''three' '' four 'twenty-one'"; // => [onetwothree, , four, twenty-one]
    Pattern pattern = Pattern.compile("(?:\\w|'[^']*')+", Pattern.UNICODE_CHARACTER_CLASS);
    Matcher matcher = pattern.matcher(s);
    List<String> tokens = new ArrayList<>();
    while (matcher.find()){
        tokens.add(matcher.group(0).replace("'", "")); 
    }
    

    Note the Pattern.UNICODE_CHARACTER_CLASS is added for the \w pattern to match all Unicode letters and digits.