What is the best API to get the edits required to transform string A to string B
A: This is a sentence
B: This was a sentences
Something like this
Edits:
I know that I can do this on my own using Levenshtein distance algo, but I want to go with an industry-standard If someone knows about any NLP/String Util or API that I can use in java, please help me out.
Spacey errant
is an option, but that is for python. I am looking for something similar to that in java.
Use Diff Match Patch library. It offers the following features:
Example based on this:
import java.util.LinkedList;
import name.fraser.neil.plaintext.diff_match_patch;
public class Test {
public static void main(String args[]) {
diff_match_patch dmp = new diff_match_patch();
LinkedList<diff_match_patch.Diff> diff = dmp.diff_main("This is a sentence", "This was a sentences");
dmp.diff_cleanupSemantic(diff);
System.out.println(diff);
}
}
Output:
[Diff(EQUAL,"This "), Diff(DELETE,"i"), Diff(INSERT,"wa"), Diff(EQUAL,"s a sentence"), Diff(INSERT,"s")]
Use Diff Utils library, which is an actively maintained fork of java-diff-utils from Google Code Archive. Diff Utils allows comparison operations between texts: computing diffs, applying patches, etc.
Example (for more examples have a look here):
import com.github.difflib.text.*;
import java.util.*;
public class Main {
public static void main(String[] args) {
// create a configured DiffRowGenerator
DiffRowGenerator generator = DiffRowGenerator.create().showInlineDiffs(true).mergeOriginalRevised(true)
.inlineDiffByWord(true).oldTag(f -> "~") // introduce markdown style for strikethrough
.newTag(f -> "**") // introduce markdown style for bold
.build();
// compute the differences for two test texts.
List<DiffRow> rows = generator.generateDiffRows(Arrays.asList("This is a sentence"),
Arrays.asList("This was a sentences"));
System.out.println(rows.get(0).getOldLine());
}
}
Output:
This ~is~**was** a ~sentence~**sentences**
Create your own Diff utility, such as the one described here. The code below is as given in the above source, and uses Longest Common Subsequence (LCS) - which is a variation of Edit distance (Levenshtein distance) - to solve this problem.
Example:
public class Main {
// Function to display the differences between two strings
public static void diff(String X, String Y, int m, int n, int[][] lookup) {
// if the last character of `X` and `Y` matches
if (m > 0 && n > 0 && X.charAt(m - 1) == Y.charAt(n - 1)) {
diff(X, Y, m - 1, n - 1, lookup);
System.out.print(" " + X.charAt(m - 1));
}
// if the current character of `Y` is not present in `X`
else if (n > 0 && (m == 0 || lookup[m][n - 1] >= lookup[m - 1][n])) {
diff(X, Y, m, n - 1, lookup);
System.out.print(" +" + Y.charAt(n - 1));
}
// if the current character of `X` is not present in `Y`
else if (m > 0 && (n == 0 || lookup[m][n - 1] < lookup[m - 1][n])) {
diff(X, Y, m - 1, n, lookup);
System.out.print(" -" + X.charAt(m - 1));
}
}
// Function to fill the lookup table by finding the length of LCS
// of substring X[0…m-1] and Y[0…n-1]
public static int[][] findLCS(String X, String Y, int m, int n) {
// lookup[i][j] stores the length of LCS of substring X[0…i-1] and Y[0…j-1]
int[][] lookup = new int[X.length() + 1][Y.length() + 1];
// first column of the lookup table will be all 0
for (int i = 0; i <= m; i++) {
lookup[i][0] = 0;
}
// first row of the lookup table will be all 0
for (int j = 0; j <= n; j++) {
lookup[0][j] = 0;
}
// fill the lookup table in a bottom-up manner
for (int i = 1; i <= m; i++) {
for (int j = 1; j <= n; j++) {
// if current character of `X` and `Y` matches
if (X.charAt(i - 1) == Y.charAt(j - 1)) {
lookup[i][j] = lookup[i - 1][j - 1] + 1;
}
// otherwise, if the current character of `X` and `Y` don't match
else {
lookup[i][j] = Integer.max(lookup[i - 1][j], lookup[i][j - 1]);
}
}
}
return lookup;
}
// Implement diff utility in Java
public static void main(String[] args) {
String X = "This is a sentence";
String Y = "This was a sentences";
// lookup[i][j] stores the length of LCS of substring X[0…i-1] and Y[0…j-1]
int[][] lookup = findLCS(X, Y, X.length(), Y.length());
// find the difference
diff(X, Y, X.length(), Y.length(), lookup);
}
}
Output:
T h i s -i +w +a s a s e n t e n c e +s