Search code examples
c#searchlevenshtein-distance

Implementing Levenstein distance for reversed string combination?


I have an employees list in my application. Every employee has name and surname, so I have a list of elements like:

["Jim Carry", "Uma Turman", "Bill Gates", "John Skeet"]

I want my customers to have a feature to search employees by names with a fuzzy-searching algorithm. For example, if user enters "Yuma Turmon", the closest element - "Uma Turman" will return. I use a Levenshtein distance algorithm, I found here.

static class LevenshteinDistance
{
    /// <summary>
    /// Compute the distance between two strings.
    /// </summary>
    public static int Compute(string s, string t)
    {
        int n = s.Length;
        int m = t.Length;
        int[,] d = new int[n + 1, m + 1];

        // Step 1
        if (n == 0)
        {
            return m;
        }

        if (m == 0)
        {
            return n;
        }

        // Step 2
        for (int i = 0; i <= n; d[i, 0] = i++)
        {
        }

        for (int j = 0; j <= m; d[0, j] = j++)
        {
        }

        // Step 3
        for (int i = 1; i <= n; i++)
        {
            //Step 4
            for (int j = 1; j <= m; j++)
            {
                // Step 5
                int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;

                // Step 6
                d[i, j] = Math.Min(
                    Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                    d[i - 1, j - 1] + cost);
            }
        }
        // Step 7
        return d[n, m];
    }
}

I iterate user's input (full name) over the list of employee names and compare distance. If it is below 3, for example, I return found employee.

Now I want allow users to search by reversed names - for example, if user inputs "Turmon Uma" it will return "Uma Turman", as actually real distance is 1, because First name and Last name is the same as Last name and First name. My algorithm now counts it as different strings, far away. How can I modify it so that names are found regardless of order?


Solution

  • A few thoughts, as this is a potentially complicated problem to get right:

    1. Split each employee name into a list of strings. Personally, I'd probably discard anything with 2 or fewer letters, unless that's all the name is composed of. This should help with surnames like "De La Cruz" which might get searched as "dela cruz". Store the list of names for each employee in a dictionary that points back to that employee.
    2. Split the search terms in the same way you split the names in the list. For each search term find the names with the lowest Levenshtein distance, then for each one, starting at the lowest, repeat the search with the rest of the search terms against the other names for that employee. Repeat this step with each word in the query. For example, if the query is John Smith, find the best single word name matches for John, then match remaining names for those "best match" employees on Smith, and get a sum of the distances. Then find the best matches for Smith and match remaining names on John, and sum the distances. The best match is the one with the lowest total distance. You can provide a list of best matches by returning the top 10, say, sorted by total distance. And it won't matter which way around the names in the database or the search terms are. In fact they could be completely out of order and it wouldn't matter.
    3. Consider how to handle hyphenated names. I'd probably split them as if they were not hyphenated.
    4. Consider upper/lower case characters, if you haven't already. You should store lookups in one case and convert the search terms to the same case before comparison.
    5. Be careful of accented letters, many people have them in their names, such as á. Your algorithm won't work correctly with them. Be even more careful if you expect to ever have non-alpha double byte characters, eg. Chinese, Japanese, Arabic, etc.

    Two more benefits of splitting the names of each employee:

    • "Unused" names won't count against the total, so if I only search using the last name, it won't count against me in finding the shortest distance.
    • Along the same lines, you could apply some extra rules to help with finding non-standard names. For example, hyphenated names could be stored both as hyphenated (eg. Wells-Harvey), compound (WellsHarvey) and individual names (Wells and Harvey separate), all against the same employee. A low-distance match on any one name is a low-distance match on the employee, again extra names don't count against the total.

    Here's some basic code that seems to work, however it only really takes into account points 1, 2 and 4:

    using System;
    using System.Collections.Generic;
    using System.Linq;
    
    namespace EmployeeSearch
    {
        static class Program
        {
            static List<string> EmployeesList = new List<string>() { "Jim Carrey", "Uma Thurman", "Bill Gates", "Jon Skeet" };
            static Dictionary<int, List<string>> employeesById = new Dictionary<int, List<string>>();
            static Dictionary<string, List<int>> employeeIdsByName = new Dictionary<string, List<int>>();
    
            static void Main()
            {
                Init();
                var results = FindEmployeeByNameFuzzy("Umaa Thurrmin");
                // Returns:
                // (1) Uma Thurman  Distance: 3
                // (0) Jim Carrey  Distance: 10
                // (3) Jon Skeet  Distance: 11
                // (2) Bill Gates  Distance: 12
                Console.WriteLine(string.Join("\r\n", results.Select(r => $"({r.Id}) {r.Name}  Distance: {r.Distance}")));
    
                var results = FindEmployeeByNameFuzzy("Tormin Oma");
                // Returns:
                // (1) Uma Thurman  Distance: 4
                // (3) Jon Skeet  Distance: 7
                // (0) Jim Carrey  Distance: 8
                // (2) Bill Gates  Distance: 9
                Console.WriteLine(string.Join("\r\n", results.Select(r => $"({r.Id}) {r.Name}  Distance: {r.Distance}")));
    
    
                Console.Read();
            }
    
            private static void Init() // prepare our lists
            {
                for (int i = 0; i < EmployeesList.Count; i++)
                {
                    // Preparing the list of names for each employee - add special cases such as hyphenation here as well
                    var names = EmployeesList[i].ToLower().Split(new char[] { ' ' }).ToList();
                    employeesById.Add(i, names);
    
                    // This is not used here, but could come in handy if you want a unique index of names pointing to employee ids for optimisation:
                    foreach (var name in names)
                    {
                        if (employeeIdsByName.ContainsKey(name))
                        {
                            employeeIdsByName[name].Add(i);
                        }
                        else
                        {
                            employeeIdsByName.Add(name, new List<int>() { i });
                        }
                    }
                }
    
            }
    
            private static List<SearchResult> FindEmployeeByNameFuzzy(string query)
            {
                var results = new List<SearchResult>();
    
                // Notice we're splitting the search terms the same way as we split the employee names above (could be refactored out into a helper method)
                var searchterms = query.ToLower().Split(new char[] { ' ' });
    
                // Comparison with each employee
                for (int i = 0; i < employeesById.Count; i++)
                {
                    var r = new SearchResult() { Id = i, Name = EmployeesList[i] };
                    var employeenames = employeesById[i];
                    foreach (var searchterm in searchterms)
                    {
                        int min = searchterm.Length;
    
                        // for each search term get the min distance for all names for this employee
                        foreach (var name in employeenames)
                        {
                            var distance = LevenshteinDistance.Compute(searchterm, name);
                            min = Math.Min(min, distance);
                        }
    
                        // Sum the minimums for all search terms
                        r.Distance += min;
                    }
                    results.Add(r);
                }
    
                // Order by lowest distance first
                return results.OrderBy(e => e.Distance).ToList();
            }
        }
    
        public class SearchResult
        {
            public int Distance { get; set; }
            public int Id { get; set; }
            public string Name { get; set; }
        }
    
        public static class LevenshteinDistance
        {
            /// <summary>
            /// Compute the distance between two strings.
            /// </summary>
            public static int Compute(string s, string t)
            {
                int n = s.Length;
                int m = t.Length;
                int[,] d = new int[n + 1, m + 1];
    
                // Step 1
                if (n == 0)
                {
                    return m;
                }
    
                if (m == 0)
                {
                    return n;
                }
    
                // Step 2
                for (int i = 0; i <= n; d[i, 0] = i++)
                {
                }
    
                for (int j = 0; j <= m; d[0, j] = j++)
                {
                }
    
                // Step 3
                for (int i = 1; i <= n; i++)
                {
                    //Step 4
                    for (int j = 1; j <= m; j++)
                    {
                        // Step 5
                        int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
    
                        // Step 6
                        d[i, j] = Math.Min(
                            Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                            d[i - 1, j - 1] + cost);
                    }
                }
                // Step 7
                return d[n, m];
            }
        }
    }
    

    Simply call Init() when you start, then call

    var results = FindEmployeeByNameFuzzy(userquery);
    

    to return an ordered list of the best matches.

    Disclaimers: This code is not optimal and has only been briefly tested, doesn't check for nulls, could explode and kill a kitten, etc, etc. If you have a large number of employees then this could be very slow. There are several improvements that could be made, for example when looping over the Levenshtein algorithm you could drop out if the distance gets above the current minimum distance.