Search code examples
ruby-on-railsrubyemailemail-attachmentsmail-gem

How to find the original email in a forward using the Mail gem


How do I use the Mail gem for Ruby to extract the original message HTML content/text content from a forwarded email?

So far all the examples I see are related to extracting content from replies (not forwards), which is made a lot easier because you can just key in on --reply above this line-- in the message.

But in my case, I’m having people forward me confirmation emails, such as how TripIt parses flight itineraries from many different airline emails.

The problem is there is a complex hierarchy of “parts”, as well as parts containing other parts, and I am trying to come up with a foolproof way to find the original HTML source so I can parse it, and extract information from a forwarded email raw source.

m = Mail.read('raw.txt')

m.parts
m.parts.first.parts
m.parts.last.parts.first.parts # never ending....

Solution

  • Here's what I have done in the past, which just recursively looks for the largest HTML body. This will probably break with multi-level forwards but in our case it only needs to be 1 forward level deep and so far works great.

    It's unfortunate the state of Stack Overflow these days thanks to stupid votes to close on every single question, that IMO is legitimate. Do people really expect you to dump 5000 lines of HTML into your question, its quite obvious what you're asking 🙄

    module EmailProcessor
      class Parser
        def initialize(email)
          @email = email
          raise 'must be initialized with type InboundEmail' unless @email.instance_of?(InboundEmail)
        end
    
        def execute
          mail = Mail.read_from_string(@email.postmark_raw['RawEmail'])
          html = find_original_html(mail)
        end
    
        private
    
        def find_original_html(mail)
          bodies = recurse_parts(mail.parts)
          sorted = bodies.sort_by{|b| -b.size}
          puts "PARSED #{sorted.size} BODIES: #{sorted.map{|b| b.size}}"
          sorted.first
        end
    
        def recurse_parts(parts)
          bodies = []
          parts.each do |part|
            if part.multipart?
              bodies += recurse_parts(part.parts)
            elsif part.content_type =~ /text\/html/
              bodies << part.body.decoded
            end
          end
          bodies
        end
      end
    end