I'm making a Lexer, and have chosen to use Regex to split my tokens.
I'm working on all different tokens, except the one that really bugs me is words and identifiers.
You see, the rules I have in place are the following:
Example of what I want:
_foo <- Invalid.
foo_ <- Invalid.
_foo_ <- Invalid.
foo_foo <- Valid.
foo_foo_foo <- Valid.
foo_foo_ <- Partially Valid. Only "foo_foo" should be picked up.
_foo_foo <- Partially Valid. Only "foo_foo" should be picked up.
I'm getting close, as this is what I currently have:
([a-zA-Z]+_[a-zA-Z]+|[a-zA-Z]+)
Except, it only detects the first occurence of an underscore. I want all of them.
Personal Request:
I would rather the answer be contained inside of a single group, as I have structured my tokeniser around them, except I would be more than happy to change my design if you can think of a better way of handling it. This is what I currently use:
private void tokenise(String regex, String[] data) {
Set<String> tokens = new LinkedHashSet<String>();
Pattern pattern = Pattern.compile(regex);
// First pass. Uses regular expressions to split data and catalog token types.
for (String line : data) {
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
for (int i = 1; i < matcher.groupCount() + 1; i++) {
if (matcher.group(i) != null) {
switch(i) {
case (1):
// Example group.
// Normally I would structure like:
// 0: Identifiers
// 1: Strings
// 2-?: So on so forth.
tokens.add("FOO:" + matcher.group());
break;
}
}
}
}
}
}
Try ([a-zA-Z]+(?:_[a-zA-Z]+)*)
The first part of the pattern, [a-zA-Z]+
, matches one or more letters.
The second part of the pattern, (?:_[a-zA-Z]+)
, matches an undescore if it is followed by one or more letters.
The *
at the end means the second part can be repeated zero or more times.
The (?: )
is like plain ()
, but doesn't return the matched group.