Tokenize Words separated by non-word characters exept single quote

I have the following method I'm trying to implement: parses the input into “word tokens”: sequences of word characters separated by non-word characters. However, non-word characters can become part of a token if they are quoted (in single quotes).
I want to use regex but have trouble getting my code just right:

public static List<String> wordTokenize(String input) {
    Pattern pattern = Pattern.compile ("\\b(?:(?<=\')[^\']*(?=\')|\\w+)\\b");
    Matcher matcher = pattern.matcher (input);
    ArrayList ans = new ArrayList();
    while (matcher.find ()){
        ans.add (matcher.group ());
    }
    return ans;
}

My regex fails to identify that starting a word mid word without space doesn't mean starting a new word. Examples:

The input: this-string 'has only three tokens' // works
The input: "this*string'has only two@tokens'"
Expected :[this, stringhas only two@tokens]
Actual :[this, string, has only two@tokens]
The input: "one'two''three' '' four 'twenty-one'"
Expected :[onetwothree, , four, twenty-one]
Actual :[one, two, three, four, twenty-one]

How do I fix the spaces?

Solution

You want to match one or more occurrences of a word char or a substring between the closest single straight apostrophes, and remove all those apostrophes from the tokens.

Use the following regex and .replace("'", "") on the matches:

(?:\w|'[^']*')+

See the regex demo. Details:

(?: - start of a non-capturing group
- \w - a word char
- | - or
- ' - a straight single quotation mark
- [^']* - any 0+ chars other than a straight single quotation mark
- ' - a straight single quotation mark
)+ - end of the group, 1+ occurrences.

See the Java demo:

// String s = "this*string'has only two@tokens'"; // => [this, stringhas only two@tokens]
String s = "one'two''three' '' four 'twenty-one'"; // => [onetwothree, , four, twenty-one]
Pattern pattern = Pattern.compile("(?:\\w|'[^']*')+", Pattern.UNICODE_CHARACTER_CLASS);
Matcher matcher = pattern.matcher(s);
List<String> tokens = new ArrayList<>();
while (matcher.find()){
    tokens.add(matcher.group(0).replace("'", "")); 
}

Note the Pattern.UNICODE_CHARACTER_CLASS is added for the \w pattern to match all Unicode letters and digits.