Search code examples
ruby-on-railsrubynlpsimilarity

Grouping bulk text as into group based on the given similarity percentage


I went through the following NLP gems available in GitHub NLP but not able to find the right solution.

Is there any gem or library available for grouping text based on a given similar percentage. All the above gems are helps to find similarity between two string but grouping a bulk array of data taking a lot of time complete.


Solution

  • You can do it by using just Ruby plus one of the listed gems.

    I chose fuzzy-string-match because I liked the name

    Here's how you use the gem:

    require 'fuzzystringmatch'
    
    # Create the matcher
    jarow = FuzzyStringMatch::JaroWinkler.create( :native )
    
    # Get the distance
    jarow.getDistance(  "jones",      "johnson" )
    # => 0.8323809523809523
    
    # Round it
    jarow.getDistance(  "jones",      "johnson" ).round(2)
    # => 0.83
    

    Since you're getting a float, you can define the precision you're looking for using the round method.

    Now, to group similar results, you can use the group_by methos found on the Enumerable module.

    You pass it a block and group_by will iterate over the collection. For each iteration, you return the value you're trying to group for (in this case, the distance) and it'll return a hash with the distances as keys and arrays of strings that matched togehter as values.

    require 'fuzzystringmatch'
    
    jarow = FuzzyStringMatch::JaroWinkler.create( :native )
    
    target = "jones"
    precision = 2
    candidates = [ "Jessica Jones", "Jones", "Johnson", "thompson", "john", "thompsen" ]
    
    distances = candidates.group_by { |candidate|
      jarow.getDistance( target, candidate ).round(precision)
    }
    
    distances
    # => {0.52=>["Jessica Jones"],
    #     0.87=>["Jones"],
    #     0.68=>["Johnson"],
    #     0.55=>["thompson", "thompsen"],
    #     0.83=>["john"]}
    

    I hope this helps