Search code examples
javastringtransformationplagiarism-detection

Track original position of transformed string in java


I am working on an implementation of a source code plagiarism algorithm(winnowing algorithm) and have a problem where I need some help.

Example: I have a string

String test="blahello,,,,/blatestbla7234///§"§$%"%$\n\n23344)§()(§$blablayeahbla";

and transform this String to

test="blahelloblatestblablablayeahbla"

and from this string I build kgrams for example 5-grams

blahe  lahel  ahell hello  ellob  llobl .... ahbla

I save the kgrams in a list of strings but would also like to save the start and end position from the original text of every kgram, so I can reference in the end every kgram back to their original text position.

EDIT:

So my question would be how can I get the start and end position of a kgram Can anyone help me there? Do you have any idea? Thanks in advance.


Solution

  • If you want the positions from the original string, you can't remove the non-letters first, or the information is lost. You'll either need to find the kgrams in the original string directly (more CPU time) or store the original position of each letter along with the modified string (more memory space).

    Here's an implementation of the latter:

    import java.util.ArrayList;
    import java.util.Arrays;
    import java.util.List;
    
    public class KGram {
    
        public final String str;
        public final int start;
        public final int end;
    
        public KGram(String str, int start, int end) {
            this.str = str;
            this.start = start;
            this.end = end;
        }
    
        @Override
        public String toString() {
            return "KGram[\"" + str + "\":" + start + "," + end + "]";
        }
    
        public static List<KGram> extractFrom(String input, int size) {
            char[] chars = new char[input.length()];
            int[] indexes = new int[input.length()];
            int len = 0;
    
            for (int i = 0; i < input.length(); i++) {
                char c = input.charAt(i);
                if (!Character.isLetter(c)) continue;
    
                chars[len] = c;
                indexes[len] = i;
                len++;
            }
    
            List<KGram> kgrams = new ArrayList<>();
            for (int i = 0, j = size - 1; j < len; i++, j++) {
                String str = new String(Arrays.copyOfRange(chars, i, j + 1));
                kgrams.add(new KGram(str, indexes[i], indexes[j]));
            }
            return kgrams;
        }
    }
    

    Example:

    String test = "blahello,,,,/blatestbla7234///§\"§$%\"%$\n\n23344)§()(§$blablayeahbla";
    List<KGram> kgrams = KGram.extractFrom(test, 5);
    
    System.out.println(kgrams.get(4));  // prints KGram["ellob":4,13]
    System.out.println(kgrams.get(26)); // prints KGram["ahbla":60,64]