Search code examples
c++similarity

How similar are two words?


I would like to measure the similarity between two words. The similarity will be a function written in c++ which return a float between 0 and 1. If the two words a very similar, then that float will be near 1 and if they are very differents, then it will be something near 0. For instance, "Analyse" and "Analise" might return 0.95 and "Substracting" and "describe" might return something near 0. How can I do that in c++.

Attempt:

float similarity(const std::string& word1, const std::string& word2) const{
    const std::size_t len1 = word1.size();
    const std::size_t len2 = word2.size();
    float score = 0;
    for(size_t i = 0; i<std::min(len1,len2);i++){
        score += (float)(word1[i]==word2[i])/len1;
    }
    return score;
}

Is it fine? Am I able to do a better job? I don't need machine learning here. This is just for testing purposes, but I can't make it too bad as well. The above attempt is ok, but it is not enough.


Solution

  • Have a look at Levenshtein Distance and Levenshtein Distance Implementation

    You can use the result of the above mentioned algorithm to achieve what you need

    Later edit:

    #include <iostream>
    #include <map>
    #include <vector>
    
    unsigned int edit_distance(const std::string& s1, const std::string& s2) {
        const std::size_t len1 = s1.size(), len2 = s2.size();
        std::vector<std::vector<unsigned int>> d(len1 + 1, std::vector<unsigned int>(len2 + 1));
    
        d[0][0] = 0;
        for(unsigned int i = 1; i <= len1; ++i) d[i][0] = i;
        for(unsigned int i = 1; i <= len2; ++i) d[0][i] = i;
    
        for(unsigned int i = 1; i <= len1; ++i)
            for(unsigned int j = 1; j <= len2; ++j)
                          d[i][j] = std::min(std::min(d[i - 1][j] + 1, d[i][j - 1] + 1),
                                             d[i - 1][j - 1] + (s1[i - 1] == s2[j - 1] ? 0 : 1));
        return d[len1][len2];
    }
    
    float similarity(const std::string& s1, const std::string& s2) {
        return 1 - 1.0 * edit_distance(s1, s2) / std::max(s1.size(), s2.size());
    }
    
    int main() {
        std::vector<std::pair<std::string, std::string>> words = {
            { "Julius", "Iulius" },
            { "Frank", "Blank" },
            { "George", "Dog" },
            { "Cat", "Elephant" },
            { "Cucumber", "Tomato" }
        };
        for (const auto& word_pair : words) {
            std::cout << "Similarity between [" << word_pair.first << "] & ["
            << word_pair.second << "]: " << similarity(word_pair.first, word_pair.second)
            << std::endl;
        }
        return 0;
    }
    

    and the output:

    Similarity between [Julius] & [Iulius]: 0.833333
    Similarity between [Frank] & [Blank]: 0.6
    Similarity between [George] & [Dog]: 0.333333
    Similarity between [Cat] & [Elephant]: 0.25
    Similarity between [Cucumber] & [Tomato]: 0.125