Search code examples
javastringnlp

API to get edits that transform string A to string B in java


What is the best API to get the edits required to transform string A to string B

A: This is a sentence
B: This was a sentences

Something like this
Edits:

  1. { start: 5, end: 7, action: "replace" text: "was" }
  2. { start: 18, action: "insert", text: "s" }

I know that I can do this on my own using Levenshtein distance algo, but I want to go with an industry-standard If someone knows about any NLP/String Util or API that I can use in java, please help me out.

Spacey errant is an option, but that is for python. I am looking for something similar to that in java.


Solution

  • Option 1

    Use Diff Match Patch library. It offers the following features:

    1. Diff: Compare two blocks of plain text and efficiently return a list of differences
    2. Match: Given a search string, find its best fuzzy match in a block of plain text. Weighted for both accuracy and location.
    3. Patch: Apply a list of patches onto plain text. Use best-effort to apply patch even when the underlying text doesn't match.

    Example based on this:

    import java.util.LinkedList;
    import name.fraser.neil.plaintext.diff_match_patch;
    
    public class Test {
      public static void main(String args[]) {
        diff_match_patch dmp = new diff_match_patch();
        LinkedList<diff_match_patch.Diff> diff = dmp.diff_main("This is a sentence", "This was a sentences");
        dmp.diff_cleanupSemantic(diff);
        System.out.println(diff);
      }
    }
    

    Output:

    [Diff(EQUAL,"This "), Diff(DELETE,"i"), Diff(INSERT,"wa"), Diff(EQUAL,"s a sentence"), Diff(INSERT,"s")]
    

    Option 2

    Use Diff Utils library, which is an actively maintained fork of java-diff-utils from Google Code Archive. Diff Utils allows comparison operations between texts: computing diffs, applying patches, etc.

    Example (for more examples have a look here):

    import com.github.difflib.text.*;
    import java.util.*;
    
    public class Main {
    
        public static void main(String[] args) {
            // create a configured DiffRowGenerator
            DiffRowGenerator generator = DiffRowGenerator.create().showInlineDiffs(true).mergeOriginalRevised(true)
                    .inlineDiffByWord(true).oldTag(f -> "~") // introduce markdown style for strikethrough
                    .newTag(f -> "**") // introduce markdown style for bold
                    .build();
    
            // compute the differences for two test texts.
            List<DiffRow> rows = generator.generateDiffRows(Arrays.asList("This is a sentence"),
                    Arrays.asList("This was a sentences"));
    
            System.out.println(rows.get(0).getOldLine());
        }
    }
    

    Output:

    This ~is~**was** a ~sentence~**sentences**
    

    Option 3

    Create your own Diff utility, such as the one described here. The code below is as given in the above source, and uses Longest Common Subsequence (LCS) - which is a variation of Edit distance (Levenshtein distance) - to solve this problem.

    Example:

    public class Main {
        // Function to display the differences between two strings
        public static void diff(String X, String Y, int m, int n, int[][] lookup) {
            // if the last character of `X` and `Y` matches
            if (m > 0 && n > 0 && X.charAt(m - 1) == Y.charAt(n - 1)) {
                diff(X, Y, m - 1, n - 1, lookup);
                System.out.print(" " + X.charAt(m - 1));
            }
    
            // if the current character of `Y` is not present in `X`
            else if (n > 0 && (m == 0 || lookup[m][n - 1] >= lookup[m - 1][n])) {
                diff(X, Y, m, n - 1, lookup);
                System.out.print(" +" + Y.charAt(n - 1));
            }
    
            // if the current character of `X` is not present in `Y`
            else if (m > 0 && (n == 0 || lookup[m][n - 1] < lookup[m - 1][n])) {
                diff(X, Y, m - 1, n, lookup);
                System.out.print(" -" + X.charAt(m - 1));
            }
        }
    
        // Function to fill the lookup table by finding the length of LCS
        // of substring X[0…m-1] and Y[0…n-1]
        public static int[][] findLCS(String X, String Y, int m, int n) {
            // lookup[i][j] stores the length of LCS of substring X[0…i-1] and Y[0…j-1]
            int[][] lookup = new int[X.length() + 1][Y.length() + 1];
    
            // first column of the lookup table will be all 0
            for (int i = 0; i <= m; i++) {
                lookup[i][0] = 0;
            }
    
            // first row of the lookup table will be all 0
            for (int j = 0; j <= n; j++) {
                lookup[0][j] = 0;
            }
    
            // fill the lookup table in a bottom-up manner
            for (int i = 1; i <= m; i++) {
                for (int j = 1; j <= n; j++) {
                    // if current character of `X` and `Y` matches
                    if (X.charAt(i - 1) == Y.charAt(j - 1)) {
                        lookup[i][j] = lookup[i - 1][j - 1] + 1;
                    }
                    // otherwise, if the current character of `X` and `Y` don't match
                    else {
                        lookup[i][j] = Integer.max(lookup[i - 1][j], lookup[i][j - 1]);
                    }
                }
            }
    
            return lookup;
        }
    
        // Implement diff utility in Java
        public static void main(String[] args) {
            String X = "This is a sentence";
            String Y = "This was a sentences";
    
            // lookup[i][j] stores the length of LCS of substring X[0…i-1] and Y[0…j-1]
            int[][] lookup = findLCS(X, Y, X.length(), Y.length());
    
            // find the difference
            diff(X, Y, X.length(), Y.length(), lookup);
        }
    }
    

    Output:

     T h i s   -i +w +a s   a   s e n t e n c e +s