I need to write a lexer for a java source code plagiarism detector. Here is an example what I want to achieve.
//Java code Tokens:
public class Count { Begin Class
public static void main(String[] args) Var Def, Begin Method
throws java.io.IOException {
int count = 0; Var Def, Assign
while (System.in.read() != -1) Apply, Begin While
count++; Assign, End While
System.out.println(count+" chars."); Apply
} End Method
} End Class
I think Jflex is the right tool to generate the lexer. However after looking through some examples. I cannot find a way to distinguish class brackets and method brackets. Most tokenizers I find just recognize them as same token. Also how do I distinguish a method apply from a variable identifier?
I cannot find a way to distinguish class brackets and method brackets.
There is nothing lexically different about them. "{".equals("{")
. The way you distinguish them is by context in the parser. The lexer can't make that distinction, nor should it.
Also how do I distinguish a method apply from a variable identifier
In the lexer, you don't. An identifier is an identifier. The token stream generated from "f(x)" should be Identifier, OpeningParenthesis, Identifier, ClosingParenthesis
.
Now in the parser you'll recognize a function name by the fact that it's followed by an opening parentheses, but again that's the parser's, not the lexer's, job.