Search code examples
javaarraylistsimilaritydocuments

How to compare many sentence in several documents in java


I have several sentences in 2 documents who compare each other . I use formula similarity for comparing them and i use List<List<>>to get element sentences from documents. But it only works for 2 documents and doesn't work if i compare it more than three for example i compare 5 documents or more.

The problem is how i get many sentence in several documents to compare them .

Here is my code.

List<List<Sentence>> collect = Arrays.asList(new File(p).listFiles()).stream()
            .map((x) -> configSentenceByLine(x.getAbsolutePath()))
            .map((x) -> tokenizingWord(x))
            .map((x) -> stemmingWord(x))
            .map((x) -> countWordBased(x))
            .collect(Collectors.toList());

for (int i = 0; i < collect.get(0).size(); i++) {
        int mr = 1;
     for (int j = 0; j < collect.get(1).size(); j++) {
          double sim = nc.getSimilarity(collect.get(0).get(i).getSentence(), collect.get(0+1).get(j+1).getSentence());
          System.out.println("Similarity = " + sim);
          mr++;
      }
}

Sorry for my bad English


Solution

  • I suppose you need to compute the similarity for all lines between all N documents. If so, you have to compare every possible pair of documents. The total number of document-pairs is the combination of n documents taken 2 at a time without repetition; thus, for 5 documents there are 10 possible pairs:

    The actual pairs are: 1-2, 1-3, 1-4, 1-5, 2-3, 2-4, 2-5, 3-4, 3-5, 4-5

    As you may notice, you initially compare the 1st document with the rest 4, then the 2nd with the rest 3 and so on.

    //for each document, except for the last one
    for (int k = 0; k < collect.size() - 1; k++) {
        //for each line i in the current document k
        for (int i = 0; i < collect.get(k).size(); i++) {
            //for each document m after k
            for (int m = k + 1; m < collect.size(); m++) {
                //for each line j in document m
                for (int j = 0; j < collect.get(m).size(); j++) {
                    //do your stuff by comparing
                    //collect.get(k).get(i).getSentence()
                    //WITH
                    //collect.get(m).get(j).getSentence()
                }
            }
        }
    }