Search code examples
c#stringalgorithmsimilarity

String similarity with varying lengths but same words


I know there are many string similarity algorithms but I don't know which would be best for my problem.

My strings vary in length but usually have some extra fluff added to one or the other I want algorithm to give high similarity "points" when strings contain same words without typos. For example Stuff and Things corp. is same as Stuff and Things corporation or 101, Stuff and Things corporat or Stuff and Things.

But strings color and colour, Loremipsum and Olremipsum in my case are totally different. My strings would never have characters which are mistyped or swapped also strings are 1 to 50 chars long.

EDIT: Order of same words is very importat New York city would be different or have low level of similarity with York New city

Thanks for any help


Solution

  • Ok, even if the rules still aren't that clear i'll give it a try.

    To summary your requirement:

    • Find the longest common word-sequence in another sentence
    • At least two words must be common, so New York and New Delhi are not equal
    • the order matters, so New York city and York New city are not equal

    The method FindCommonWords returns a sequence of words that are common in both sentences or an empty sequence(Enumerable.Empty<string>) if no common word sequence was found.

    It first splits both strings by a pre-defined list of word separators into two string[]. Then it checks all "sub-sequences" whether or not it is contained in the other array in the same order(with an extension method IndexOfSequence).

    private static readonly char[] wordSeparators = { '\n', '\t', ',', '.', '!', '?', ';', ':', ' ', '-', '/', '\\', '[', ']', '(', ')', '<', '>', '@', '"', '\'' };
    
    public static IEnumerable<string> FindCommonWords(string str1, string str2, StringComparer comparer = null)
    {
        if (str1 == null)
            throw new ArgumentNullException("str1", "Both input strings must not be null!");
        if (str2 == null)
            throw new ArgumentNullException("str2", "Both input strings must not be null!");
    
        if (comparer == null) comparer = StringComparer.CurrentCulture;
        str1 = str1.Trim();
        str2 = str2.Trim();
    
        string[] words1 = str1.Split(wordSeparators, StringSplitOptions.RemoveEmptyEntries);
        string[] words2 = str2.Split(wordSeparators, StringSplitOptions.RemoveEmptyEntries);
        if(Math.Min(words1.Length, words2.Length) < 2)
            return Enumerable.Empty<string>(); // one word is not supposed to be a commnon word sequence
    
        // use for-loop to find the longest common words
        for (int wordCount = words1.Length - 1; wordCount >= 2; wordCount--)
        {
            // scan word-count from left to right
            for (int skipCount = 0; wordCount + skipCount <= words1.Length; skipCount++)
            {
                // take wordCount-words from left side and walk from left to right
                IEnumerable<string> wordSeq = words1.Skip(skipCount).Take(wordCount);
                // search sequence in other words
                int indexInWords2 = words2.IndexOfSequence(wordSeq, comparer);
                if (indexInWords2 >= 0)
                {
                    // found match in other words, must be longest common sequence
                    return wordSeq;
                }
            }
        }
        return Enumerable.Empty<string>();
    }
    

    Here's the extension which might also be useful for other requirements:

    public static int IndexOfSequence<TSource>(this IEnumerable<TSource> input, IEnumerable<TSource> sequence, IEqualityComparer<TSource> comparer)
    {
        if (input == null) throw new ArgumentNullException("input");
        if (sequence == null) throw new ArgumentNullException("sequence");
        if (!sequence.Any()) throw new ArgumentException("Sequence must not be empty", "sequence");
        if (comparer == null)
        {
            comparer = EqualityComparer<TSource>.Default;
        }
        int index = -1, firstIndex = -1, lastFoundIndex = -1;
        bool found = false;
    
        using (IEnumerator<TSource> enumerator = input.GetEnumerator())
        {
            using (IEnumerator<TSource> enumerator2 = sequence.GetEnumerator())
            {
                enumerator2.MoveNext();
                while (enumerator.MoveNext())
                {
                    index++;
                    found = comparer.Equals(enumerator.Current, enumerator2.Current);
                    if (found && firstIndex == -1)
                        firstIndex = index;
                    else if (found && index != lastFoundIndex + 1)
                        found = false; // sequence must be consecutive
                    if (found && !enumerator2.MoveNext())
                        return firstIndex;
                    if(found)
                        lastFoundIndex = index;
                }
            }
        }
        return -1;
    }
    

    Here are your three samples:

    var commonWords = FindCommonWords(
         "Stuff and Things corporation", 
         "101, Stuff and Things corporat", 
         StringComparer.CurrentCultureIgnoreCase);
    Console.WriteLine(string.Join(" ", commonWords));   // Stuff and Things
    
    commonWords = FindCommonWords(
         "101, Stuff and Things corporat",
         "or Stuff and Things.",
         StringComparer.CurrentCultureIgnoreCase);
    Console.WriteLine( string.Join(" ", commonWords) ); // Stuff and Things
    
    commonWords = FindCommonWords(
         "New York city",
         "York New city",
         StringComparer.CurrentCultureIgnoreCase);
    Console.WriteLine(string.Join(" ", commonWords));  // empty sequence, no match
    

    Note that it's written from the scratch and not tested thoroughly.