What would be the best way to compare a pattern with a set of strings, one by one, while rating the amount with which the pattern matches each string? In my limited experience with regex, matching strings with patterns using regex seems to be a pretty binary operation...no matter how complicated the pattern is, in the end, it either matches or it doesn't. I am looking for greater capabilities, beyond just matching. Is there a good technique or algorithm that relates to this?
Here's an example:
Lets say I have a pattern foo bar
and I want to find the string that most closely matches it out of the following strings:
foo for
foo bax
foo buo
fxx bar
Now, none of these actually match the pattern, but which non-match is the closest to being a match? In this case, foo bax
would be the best choice, since it matches 6 out of the 7 characters.
Apologies if this is a duplicate question, I didn't really know what exactly to search for when I looked to see if this question already exists.
This one works, I checked with Wikipedia example distance between "kitten" and "sitting" is 3
public class LevenshteinDistance {
public static final String TEST_STRING = "foo bar";
public static void main(String ...args){
LevenshteinDistance test = new LevenshteinDistance();
List<String> testList = new ArrayList<String>();
testList.add("foo for");
testList.add("foo bax");
testList.add("foo buo");
testList.add("fxx bar");
for (String string : testList) {
System.out.println("Levenshtein Distance for " + string + " is " + test.getLevenshteinDistance(TEST_STRING, string));
}
}
public int getLevenshteinDistance (String s, String t) {
if (s == null || t == null) {
throw new IllegalArgumentException("Strings must not be null");
}
int n = s.length(); // length of s
int m = t.length(); // length of t
if (n == 0) {
return m;
} else if (m == 0) {
return n;
}
int p[] = new int[n+1]; //'previous' cost array, horizontally
int d[] = new int[n+1]; // cost array, horizontally
int _d[]; //placeholder to assist in swapping p and d
// indexes into strings s and t
int i; // iterates through s
int j; // iterates through t
char t_j; // jth character of t
int cost; // cost
for (i = 0; i<=n; i++) {
p[i] = i;
}
for (j = 1; j<=m; j++) {
t_j = t.charAt(j-1);
d[0] = j;
for (i=1; i<=n; i++) {
cost = s.charAt(i-1)==t_j ? 0 : 1;
// minimum of cell to the left+1, to the top+1, diagonally left and up +cost
d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1), p[i-1]+cost);
}
// copy current distance counts to 'previous row' distance counts
_d = p;
p = d;
d = _d;
}
// our last action in the above loop was to switch d and p, so p now
// actually has the most recent cost counts
return p[n];
}
}