Search code examples
c#stringinvoice

c#: How to find which string from a list of strings has the most matching characters in a row to a sample string?


I am attempting to match line items from an invoice to corresponding line items from a contract using a description of the item. Here is an example of an invoice item description (after I have removed unnecessary characters):

"145HBK25G"

and here is an example of a list of contract line items (also after removing unnecessary characters):

"PU125TECHNOAES145HBK2",
"PU1212TECHNOAESHS210ZHEA",
"PU1219TECHNOAESHS200A5A23",
"PU129TECHNOAESBM5614A3A01"

In this example, I would want the first item in the contract list to be selected, because it has the most matching characters in a row ("145HBK2") from the invoice string, even though it does not contain the entire string.

Currently, the method being used to determine which contract line item has the most matching characters to the invoice line item is the following code block, where "ili" is the invoice line item description, and "cli" is the contract line item description:

int maxCharactersMatchedGlobal = 0;

contractLineItemDescriptionsCleansed.ForEach(cli => 
{
     int maxCharactersMatched = Enumerable.Range(0, Math.Min(ili.Length, cli.Length)).Count(i => ili[i] == cli[i]);

     if (maxCharactersMatched > maxCharactersMatchedGlobal)
     {
          maxCharactersMatchedGlobal = maxCharactersMatched;
     }
});

However, this will only work if the contract line item description begins with the characters that are being compared to the invoice line item description. I would like to improve this code so that it checks everywhere in each of the contract line item descriptions for the most characters matching the invoice line item description, even if the cli does not contain the entire ili.


Solution

  • Enumerable.Intersect, as @stuartd had mentioned, is a close option, but I assume you want the matched portion to be a contiguous string (i.e. HKG24S should not match on input HKS because the of the S).

    In this case, I think you're going to have to build your own string matcher.

    Disclaimer: The following recommendation is not very performant

    I'd recommend breaking the problem into parts. First, you can enumerate all the substrings of your search term (pick a min length, probably 3, though if you want to match on a single character, go ahead):

    Given 145HBK25G, you could write a method to produce:

    145
    145H
    145HB
    145HBK2
    145HBK25
    145HBK25G
    45H
    45HB
    ... etc all the way to 
    25G
    

    From there you can iterate over your invoice line items and use string.Contains() to check if each of your substrings is in the line item text. Keeping track of what the longest string matched was. Then, the line item (or items) with the longest matches are your winners.

    Not very performant, but it'll get the job done.