I have the following method I'm trying to implement: parses the input into “word tokens”: sequences of word characters separated by non-word characters. However, non-word characters can become part of a token if they are quoted (in single quotes).
I want to use regex but have trouble getting my code just right:
public static List<String> wordTokenize(String input) {
Pattern pattern = Pattern.compile ("\\b(?:(?<=\')[^\']*(?=\')|\\w+)\\b");
Matcher matcher = pattern.matcher (input);
ArrayList ans = new ArrayList();
while (matcher.find ()){
ans.add (matcher.group ());
}
return ans;
}
My regex fails to identify that starting a word mid word without space doesn't mean starting a new word. Examples:
The input: this-string 'has only three tokens' // works
The input:
"this*string'has only two@tokens'"
Expected :[this, stringhas only two@tokens]
Actual :[this, string, has only two@tokens]
The input: "one'two''three' '' four 'twenty-one'"
Expected :[onetwothree, , four, twenty-one]
Actual :[one, two, three, four, twenty-one]
How do I fix the spaces?
You want to match one or more occurrences of a word char or a substring between the closest single straight apostrophes, and remove all those apostrophes from the tokens.
Use the following regex and .replace("'", "")
on the matches:
(?:\w|'[^']*')+
See the regex demo. Details:
(?:
- start of a non-capturing group
\w
- a word char|
- or'
- a straight single quotation mark[^']*
- any 0+ chars other than a straight single quotation mark'
- a straight single quotation mark)+
- end of the group, 1+ occurrences.See the Java demo:
// String s = "this*string'has only two@tokens'"; // => [this, stringhas only two@tokens]
String s = "one'two''three' '' four 'twenty-one'"; // => [onetwothree, , four, twenty-one]
Pattern pattern = Pattern.compile("(?:\\w|'[^']*')+", Pattern.UNICODE_CHARACTER_CLASS);
Matcher matcher = pattern.matcher(s);
List<String> tokens = new ArrayList<>();
while (matcher.find()){
tokens.add(matcher.group(0).replace("'", ""));
}
Note the Pattern.UNICODE_CHARACTER_CLASS
is added for the \w
pattern to match all Unicode letters and digits.