Search code examples
rubytaggingtext-mining

Using Ruby to tag records that contain repeat phrases in a table


I'm trying to use Ruby to 'tag' records in a CSV table, based on whether or not a particular field contains a certain phrase that is repeated. I'm not sure if there are libraries to assist with this kind of job, and I recognize that Ruby might not be the most efficient language to do this sort of thing.

My CSV table contains a unique ID and a text field that I want to search:

ID,NOTES
1,MISSING DOB; ID CANNOT BE BLANK
2,INVALID MEMBER ID - unable to verify
3,needs follow-up
4,ID CANNOT BE BLANK-- additional info needed

From this CSV table, I've extracted keywords and assigned them a tag, which I've stored in another CSV table.

PHRASE,TAG
MISSING DOB,BLANKDOB
ID CANNOT BE BLANK,BLANKID
INVALID MEMBER ID,INVALIDID

Note that the NOTES column in my source contains punctuation and other phrases in addition to the phrases I have identified and want to map. Additionally, not all records have phrases that will match.

I want to create a table that looks something like this:

ID, TAG
1, BLANKDOB
1, BLANKID
2, INVALIDID
4, BLANKID

Or, alternately with the tags delimited with another character:

ID, TAG
1, BLANKDOB; BLANKID
2, INVALIDID
4, BLANKID

I have loaded the mapping table into a hash, with the phrase as the key.

phrase_hash = {}
    CSV.foreach("phrase_lookup.csv") do |row|
        phrase, tag = row
        next if name == "PHRASE"
        phrase_hash[phrase] = tag
    end

The keys of the hash are then the search phrases that I want to iterate through. I'm having trouble expressing what I want to do next in Ruby, but here's the idea:

Load the NOTES table into an array. For each phrase (i.e. key), select the records from the array that contain the phrase, gather the IDs associated with these rows, and output them with the associated tag for that phrase, as above.

Can anyone help?


Solution

  • I'll give you an example using hash inputs instead of CSV:

    notes = { 1 => "MISSING DOB; ID CANNOT BE BLANK",
              2 => "INVALID MEMBER ID - unable to verify",
              3 => "needs follow-up",
              4 => "ID CANNOT BE BLANK-- additional info needed"
            }
    
    tags =  { "MISSING DOB" => "BLANKDOB",
              "ID CANNOT BE BLANK" => "BLANKID",
              "INVALID MEMBER ID" => "INVALIDID"
            }
    
    output = {}
    
    tags.each_pair do |tags_key,tags_value|
        notes.each_pair do |notes_key, notes_value|
            if notes_value.match(tags_key)
                output[notes_key] ||= []
                output[notes_key] << tags_value 
            end
        end
    end 
    
    puts output.map {|k,v| "#{k}, #{v.join("; ")}"}.sort