Search code examples
ruby-on-railsregextext-parsing

Parse a call transcript into array of hash - Ruby


I am parsing a call transcript. The content of the transcript comes back as a string formatted like the following:

"Operator: Hi, please welcome Bob Smith to the call. Bob Smith: Hello there, thank you for inviting me...Now I will turn the call over to Stacy. Stacy White: Thanks Bob. As he was saying...."

There is no new line when each new speaker starts speaking.

I would like to turn the above string into the an array of hash. Something like the following:

[ { speaker: "Operator",
    content: "Hi, please welcome Bob Smith to the call" },
  { speaker: "Bob Smith",
    content: "Hello there, thank you for inviting me...Now I will turn the call over to Stacy." }, 
  { speaker: "Stacy White",
    content: "Thanks Bob. As he was saying...." }
]

I think I would need to use some sort of regular expression to parse this, but have no idea how even after spending the morning reading up on it. Any help here would be much appreciated.

Thanks

Update:

To others that may find this useful, here's what I ended up coming up with using the suggested solution below:

def display_transcript
  transcript_pretty = []
  transcript = self.content
  transcript_split = transcript.split(/\W*([A-Z]\w*\W*\w+):\W*/)[1..-1]
  transcript_split_2d = transcript_split.each_slice(2).to_a
  transcript_split_2d.each do |row|
    blurb = { speaker: row[0], content: row[1]}
    transcript_pretty << blurb
  end

  return transcript_pretty
end

Solution

  • I can give you an expression you can use to break up the string. From there you can take it on yourself I'm sure, you wouldn't want me to take away the pleasure of reaching your goal is it ? :>)

    string = "Operator: Hi, please welcome Bob Smith to the call. Bob Smith: Hello there, thank you for inviting me...Now I will turn the call over to Stacy. Stacy White: Thanks Bob. As he was saying...."
    split_up = string.split(/\W*(\w*\W*\w+):\W*/)[1..-1]
    Hash[*split_up]
    # {"Operator"=>"Hi, please welcome Bob Smith to the call", "Bob Smith"=>"Hello there, thank you for inviting me...Now I will turn the call over to Stacy", "Stacy White"=>"Thanks Bob. As he was saying...."}
    

    Some explanation: the regular expression looks for one or two words (\w*\W*\w+), eventually prepended with a dot and a space \W* and followed by a double point and eventually spaces following :\W* This expression is used to split the string in an array. The result always has an empty string to start with so you get rid of that by the [1..-1] Next you convert that Array into a Hash, the first element is the key, the second the value and so on until the end of the Array.