Search code examples
javastringapache-commonsstring-comparisonlevenshtein-distance

Better compare string method


My question is What's the fastest (quality is important too, but a little less important) way to compare two strings?

I'm looking for the most efficient way to compare two strings. Some of the strings I'm comparing can be over 5000 characters long. I'm comparing a list of about 80 strings with another list of about 200 strings. It takes forever, even when I'm threading it. I'm using the StringUtils.getLevenshteinDistance(String s, String t) method from Apache Commons. My method follows. Is there a better way to do this?

private void compareMe() {
  List<String> compareStrings = MainController.getInstance().getCompareStrings();
  for (String compare : compareStrings) {
    int levenshteinDistance = StringUtils.getLevenshteinDistance(me, compare);
    if (bestScore > levenshteinDistance
          && levenshteinDistance > -1) {
      bestScore = levenshteinDistance; //global variable
      bestString = compare; //global variable
    }
  }
}

Here's a sample of two strings which should have a good score:

String 1:

SELECT 
CORP_VENDOR_NAME as "Corporate Vendor Name",
CORP_VENDOR_REF_ID as "Reference ID",
MERCHANT_ID as "Merchant ID",
VENDOR_CITY as "City",
VENDOR_STATE as "State",
VENDOR_ZIP as "Zip",
VENDOR_COUNTRY as "Country",
REMIT_VENDOR_NAME as "Remit Name",
REMIT_VENDOR_REF_ID as " Remit Reference ID",
VENDOR_PRI_UNSPSC_CODE as "Primary UNSPSC"
FROM DSS_FIN_USER.ACQ_VENDOR_DIM
WHERE VENDOR_REFERENCE_ID in 
(SELECT distinct CORP_VENDOR_REF_ID
FROM DSS_FIN_USER.ACQ_VENDOR_DIM
WHERE CORP_VENDOR_REF_ID = '${request.corp_vendor_id};')

String 2:

SELECT 
CORP_VENDOR_NAME as "Corporate Vendor Name",
CORP_VENDOR_REF_ID as "Reference ID",
MERCHANT_ID as "Merchant ID",
VENDOR_CITY as "City",
VENDOR_STATE as "State",
VENDOR_ZIP as "Zip",
VENDOR_COUNTRY as "Country",
REMIT_VENDOR_NAME as "Remit Name",
REMIT_VENDOR_REF_ID as " Remit Reference ID",
VENDOR_PRI_UNSPSC_CODE as "Primary UNSPSC"
FROM DSS_FIN_USER.ACQ_VENDOR_DIM
WHERE VENDOR_REFERENCE_ID in 
(SELECT distinct CORP_VENDOR_REF_ID
FROM DSS_FIN_USER.ACQ_VENDOR_DIM
WHERE CORP_VENDOR_REF_ID = 'ACQ-169013')

You'll notice the only difference is the '${request.corp_vendor_id};' at the end of the string. This would cause it to have a score of 26 from the LevenshteinDistance method.


Solution

  • You should think about possible shortcuts in your comparison logic to avoid some calculations at all. So if you want to minimize the Levensthein distance globally, you do not even need to calculate it, if the difference of the string sizes is higher than your current best Levenshtein distance.

    E.g. if your current best Levenshtein distance is 50, then you can avoid comparing two strings of size 100 and 180, because their Levenshtein distance is at least 80.