I have a custom object called customer with fields such as Customer_Name, Address_Line_1, Post_Code etc.
I'd like to run through all records and compare Customer_Name for likeness (based on fuzzy search or levenshtein distance). If the likeness is above or below a certain threshold, a custom field (Possible_Duplicate_Customer_ID__c) will be updated to identify the possible duplicate.
I've managed to implement this but I'm experiencing 2 issues:
1). exceeding Salesforce govenor limits (too many script statements: 200001) likely caused by the heavy looping required by the Levenshtein distance algorithm. 2). Also the list (newList) I am committing contains duplicate ID's.
private static List<Customer__c> newList = new List<Customer__c>();
webService static Integer findDupes() {
Integer returnCount = 0;
Double cost = 0;
Integer COST_THRESHOLD = 5;
Map<id,Customer__c> cMap = new Map<id,Customer__c>([
select ID, Name, Customer_Name__c, Possible_Duplicate_Customer_ID__c
from Customer__c
]);
List<Customer__c> custList1 = cMap.values();
List<Customer__c> custList2 = custList1.clone();
for (Customer__c cust1 :custList1) {
for (Customer__c cust2 :custList2) {
cost = LevenshteinDistance.computeLevenshteinDistance(
cust1.Customer_Name__c, cust2.Customer_Name__c);
if(cost<COST_THRESHOLD && cost != 0) {
Customer__c c = new Customer__c(
id = cust2.Id,
Possible_Duplicate_Customer_ID__c = cust1.Name
);
newList.add(c);
}
System.debug(cost+' edits to transform '
+cust1.Customer_Name__c+' to '+cust2.Customer_Name__c);
}
}
returnCount = newList.size();
update newList;
return returnCount;
}
I would suggest running the code inside a class which uses the batchable interface, this is much better suited to processing large volumes of data. Since your web service doesn't take an input to work with, you could run the batch hourly on a schedule, flag dupes by marking the records and then extract those in the webservice. Of course, if you need it to be in realtime you'll need to optimise this loop instead.
As for the duplicate ids in the update list, your use of cust2.Id
for the updates should account for this, but you don't seem to protecting against the case where a customer record is compared with itself! This should fix it up:
for (Customer__c cust1 :custList1) {
for (Customer__c cust2 :custList2) {
if (cust1.Id == cust2.Id) {
continue;
}