Grouping bulk text as into group based on the given similarity percentage

I went through the following NLP gems available in GitHub NLP but not able to find the right solution.

Is there any gem or library available for grouping text based on a given similar percentage. All the above gems are helps to find similarity between two string but grouping a bulk array of data taking a lot of time complete.

Solution

You can do it by using just Ruby plus one of the listed gems.

I chose fuzzy-string-match because I liked the name

Here's how you use the gem:

require 'fuzzystringmatch'

# Create the matcher
jarow = FuzzyStringMatch::JaroWinkler.create( :native )

# Get the distance
jarow.getDistance(  "jones",      "johnson" )
# => 0.8323809523809523

# Round it
jarow.getDistance(  "jones",      "johnson" ).round(2)
# => 0.83

Since you're getting a float, you can define the precision you're looking for using the round method.

Now, to group similar results, you can use the group_by methos found on the Enumerable module.

You pass it a block and group_by will iterate over the collection. For each iteration, you return the value you're trying to group for (in this case, the distance) and it'll return a hash with the distances as keys and arrays of strings that matched togehter as values.

require 'fuzzystringmatch'

jarow = FuzzyStringMatch::JaroWinkler.create( :native )

target = "jones"
precision = 2
candidates = [ "Jessica Jones", "Jones", "Johnson", "thompson", "john", "thompsen" ]

distances = candidates.group_by { |candidate|
  jarow.getDistance( target, candidate ).round(precision)
}

distances
# => {0.52=>["Jessica Jones"],
#     0.87=>["Jones"],
#     0.68=>["Johnson"],
#     0.55=>["thompson", "thompsen"],
#     0.83=>["john"]}

I hope this helps