I went through the following NLP gems available in GitHub NLP but not able to find the right solution.
Is there any gem or library available for grouping text based on a given similar percentage. All the above gems are helps to find similarity between two string but grouping a bulk array of data taking a lot of time complete.
You can do it by using just Ruby plus one of the listed gems.
I chose fuzzy-string-match
because I liked the name
Here's how you use the gem:
require 'fuzzystringmatch'
# Create the matcher
jarow = FuzzyStringMatch::JaroWinkler.create( :native )
# Get the distance
jarow.getDistance( "jones", "johnson" )
# => 0.8323809523809523
# Round it
jarow.getDistance( "jones", "johnson" ).round(2)
# => 0.83
Since you're getting a float, you can define the precision you're looking for using the round
method.
Now, to group similar results, you can use the group_by
methos found on the Enumerable
module.
You pass it a block and group_by
will iterate over the collection. For each iteration, you return the value you're trying to group for (in this case, the distance) and it'll return a hash with the distances as keys and arrays of strings that matched togehter as values.
require 'fuzzystringmatch'
jarow = FuzzyStringMatch::JaroWinkler.create( :native )
target = "jones"
precision = 2
candidates = [ "Jessica Jones", "Jones", "Johnson", "thompson", "john", "thompsen" ]
distances = candidates.group_by { |candidate|
jarow.getDistance( target, candidate ).round(precision)
}
distances
# => {0.52=>["Jessica Jones"],
# 0.87=>["Jones"],
# 0.68=>["Johnson"],
# 0.55=>["thompson", "thompsen"],
# 0.83=>["john"]}
I hope this helps