I am working on an implementation of a source code plagiarism algorithm(winnowing algorithm) and have a problem where I need some help.
Example: I have a string
String test="blahello,,,,/blatestbla7234///§"§$%"%$\n\n23344)§()(§$blablayeahbla";
and transform this String to
test="blahelloblatestblablablayeahbla"
and from this string I build kgrams for example 5-grams
blahe lahel ahell hello ellob llobl .... ahbla
I save the kgrams in a list of strings but would also like to save the start and end position from the original text of every kgram, so I can reference in the end every kgram back to their original text position.
EDIT:
So my question would be how can I get the start and end position of a kgram Can anyone help me there? Do you have any idea? Thanks in advance.
If you want the positions from the original string, you can't remove the non-letters first, or the information is lost. You'll either need to find the kgrams in the original string directly (more CPU time) or store the original position of each letter along with the modified string (more memory space).
Here's an implementation of the latter:
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
public class KGram {
public final String str;
public final int start;
public final int end;
public KGram(String str, int start, int end) {
this.str = str;
this.start = start;
this.end = end;
}
@Override
public String toString() {
return "KGram[\"" + str + "\":" + start + "," + end + "]";
}
public static List<KGram> extractFrom(String input, int size) {
char[] chars = new char[input.length()];
int[] indexes = new int[input.length()];
int len = 0;
for (int i = 0; i < input.length(); i++) {
char c = input.charAt(i);
if (!Character.isLetter(c)) continue;
chars[len] = c;
indexes[len] = i;
len++;
}
List<KGram> kgrams = new ArrayList<>();
for (int i = 0, j = size - 1; j < len; i++, j++) {
String str = new String(Arrays.copyOfRange(chars, i, j + 1));
kgrams.add(new KGram(str, indexes[i], indexes[j]));
}
return kgrams;
}
}
Example:
String test = "blahello,,,,/blatestbla7234///§\"§$%\"%$\n\n23344)§()(§$blablayeahbla";
List<KGram> kgrams = KGram.extractFrom(test, 5);
System.out.println(kgrams.get(4)); // prints KGram["ellob":4,13]
System.out.println(kgrams.get(26)); // prints KGram["ahbla":60,64]