Search code examples
pythonemailmultipart

How to extract text from the plain text part of multipart/alternative?


# main.py
import email
from email.iterators import _structure
import sys
msg = email.message_from_string(sys.stdin.read())
_structure(msg)
./main.py <<EOF
From:  Nathaniel Borenstein <[email protected]>
To: Ned Freed <[email protected]>
Subject: Formatted text mail
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary=boundary42


--boundary42
Content-Type: text/plain; charset=us-ascii

...plain text version of message goes here....

--boundary42
Content-Type: text/richtext

.... richtext version of same message goes here ...
--boundary42
Content-Type: text/x-whatever

.... fanciest formatted version of same  message  goes  here
...
--boundary42--
EOF

The output

multipart/alternative
    text/plain
    text/richtext
    text/x-whatever

I can call the email module to get the structure of a multipart email message like the above. How can I extract the text/plain part of the email message? (In this particular example, it should be "...plain text version of message goes here....".)


Solution

  • You call msg.get_payload() to get the payload of your message, and then you iterate over the parts until you find the text/plain part:

    # main.py
    import email
    import sys
    
    msg = email.message_from_string(sys.stdin.read())
    
    for part in msg.get_payload():
        if part.get_content_type() == 'text/plain':
            print(part.get_payload())
    

    Given your sample input, the above code produces as output:

    ...plain text version of message goes here....
    

    You could instead use email.iterators.typed_subpart_iterator, like this:

    # main.py
    import email
    import email.iterators
    import sys
    
    msg = email.message_from_string(sys.stdin.read())
    
    for part in email.iterators.typed_subpart_iterator(msg, maintype='text', subtype="plain"):
        print(part.get_payload())
    

    This produces the same output as the earlier example.


    docs.python.org/3/library/email.parser.html says get_body() can work

    The get_body method is only available on email.message.EmailMessage, but by default email.message_from_string returns a legacy email.message.Message object (see the docs here).

    In order to get an email.message.EmailMessage object, you need to pass in a policy parameter:

    import email
    import email.policy
    
    msg = email.message_from_string(sys.stdin.read(), policy=email.policy.default)
    
    print(msg.get_body().get_payload())
    

    This will also produce the same output as the first example.