Search code examples
javaalgorithmlevenshtein-distance

How do I find the percentage of similarity between two multiline Strings?


I have got two multi-line strings. I'm using the following code to determine the similarity between two of them. This makes use of Levenshtein distance algorithm.

  public static double similarity(String s1, String s2) {
    String longer = s1, shorter = s2;
    if (s1.length() < s2.length()) { 
      longer = s2; shorter = s1;
    }
    int longerLength = longer.length();
    if (longerLength == 0) { return 1.0; /* both strings are zero length */ }

    return (longerLength - editDistance(longer, shorter)) / (double) longerLength;

  }

  public static int editDistance(String s1, String s2) {
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();

    int[] costs = new int[s2.length() + 1];
    for (int i = 0; i <= s1.length(); i++) {
      int lastValue = i;
      for (int j = 0; j <= s2.length(); j++) {
        if (i == 0)
          costs[j] = j;
        else {
          if (j > 0) {
            int newValue = costs[j - 1];
            if (s1.charAt(i - 1) != s2.charAt(j - 1))
              newValue = Math.min(Math.min(newValue, lastValue),
                  costs[j]) + 1;
            costs[j - 1] = lastValue;
            lastValue = newValue;
          }
        }
      }
      if (i > 0)
        costs[s2.length()] = lastValue;
    }
    return costs[s2.length()];
  }

But the above code is not working as expected.

For instance lets say that we have got the following two strings say s1 and s2,

S1 -> How do we optimize the performance? . What should we do to compare both strings to find the percentage of similarity between both?

S2-> How do we optimize tje performance? What should we do to compare both strings to find the percentage of similarity between both?

Then I'm passing the above string to similarity method but it does not find the exact percentage of difference. How do I optimize the algorithm?

Following is my main method

update:

public static boolean authQuestion(String question) throws SQLException{


        boolean isQuestionAvailable = false;
        Connection dbCon = null;
        try {
            dbCon = MyResource.getConnection();
            String query = "SELECT * FROM WORDBANK where WORD ~*  ?;";
            PreparedStatement checkStmt = dbCon.prepareStatement(query);
            checkStmt.setString(1, question);
            ResultSet rs = checkStmt.executeQuery();
            while (rs.next()) {
                double re=similarity( rs.getString("question"), question);
                if(re  > 0.6){
                    isQuestionAvailable = true;
                }else {
                    isQuestionAvailable = false;
                }
            }
        } catch (URISyntaxException e1) {
            e1.printStackTrace();
        } catch (SQLException sqle) {
            sqle.printStackTrace();
        } catch (Exception e) {
            if (dbCon != null)
                dbCon.close();
        } finally {
            if (dbCon != null)
                dbCon.close();
        }

        return isQuestionAvailable;
    }

Solution

  • I can suggest you an approach...

    You are using edit distance, which gives you the number of characters in S1 you need to change/add/remove in order to turn it to S2.

    So, for example:

    S1 = "abc"
    S2 = "cde"
    

    the edit distance is 3 and they are 100% different (taking in consideration you see it in some kind of char by char comparison).

    So you can have an approximate percentage if you do

    S1 = "abc"
    S2 = "cde"
    edit = edit_distance(S1, S2)
    percentage = min(edit/S1.length(), edit/S2.length())
    

    the min is a workaround to treat the cases where the strings are very different, for example:

    S1 = "abc"
    S2 = "defghijklmno"
    

    so the edit distance would be bigger than the length of S1 and the percentage should be more than 100%, so maybe dividing by the bigger of the sizes should be better.

    hope that helps