Search code examples
rubysortinghashsubstringpos-tagger

Filtering duplicate substrings from a hash in Ruby


I'm writing a Rails app to get RSS feeds from news pages, apply part-of-speech tagging to the title, get the noun-phrases from the titles and the amount of times each occurs. I need to filter out the noun-phrases that are part of other noun phrases, and am using this code to do so:

filtered_noun_phrases = sorted_noun_phrases.select{|a|
  sorted_noun_phrases.keys.any?{|b| b != a and a.index(b) } }.to_h

so this:

{"troops retake main government office"=>2,
 "retake main government office"=>2, "main government office"=>2}

should become just:

{"troops retake main government office"=>2}

However, a sorted hash of noun-phrases such as this:

{"troops retake main government office"=>2, "chinese students fighting racism"=>2,
 "retake main government office"=>2, "mosul retake government base"=>2,
 "toddler killer shot dead"=>2, "students fighting racism"=>2,
 "retake government base"=>2, "main government office"=>2,
 "white house tourists"=>2, "horn at french zoo"=>2, "government office"=>2,
 "cia hacking tools"=>2, "killer shot dead"=>2, "government base"=>2,
 "boko haram teen"=>2, "horn chainsawed"=>2, "fighting racism"=>2,
 "silver surfers"=>2, "house tourists"=>2, "natural causes"=>2,
 "george michael"=>2, "instagram fame"=>2, "hacking tools"=>2,
 "iraqi forces"=>2, "mosul battle"=>2, "own wedding"=>2, "french zoo"=>2,
 "haram teen"=>2, "hacked tvs"=>2, "shot dead"=>2}

Instead only partially filters:

{"troops retake main government office"=>2, "chinese students fighting racism"=>2,
 "retake main government office"=>2, "mosul retake government base"=>2,
 "toddler killer shot dead"=>2, "students fighting racism"=>2,
 "retake government base"=>2, "main government office"=>2,
 "white house tourists"=>2, "horn at french zoo"=>2,
 "cia hacking tools"=>2, "killer shot dead"=>2,
 "boko haram teen"=>2}

So how can I filter duplicate substrings out of a hash that actually works?


Solution

  • What you are currently doing is selecting all phrases for which any phrase exist that is a substring of the phrase.

    For "troops retake main government office" this is true, as we find "retake main government office".

    However for "retake main government office" we still find "main government office", thus not filtering it out.

    Doing for instance:

     filtered_noun_phrases = sorted_noun_phrases.reject{|a| sorted_noun_phrases.keys.any?{|b| b != a and b.index(a) } }.to_h
    

    you can reject all phrases for which any string exists that includes the phrase.