Search code examples
python-3.xemailbase64email-attachmentsdecoding

How to extract data from 'application/pkcs7-mime' using the email-module in Python?


Problem

I am working on a project where we have to classify e-mails. For this project I need to extract ALL text from the e-mails and their attachments.

My problem is that some attachments are of the type "application/pkcs7-mime", and I am not sure how to handle those.

What I have tried

import email, base64

# Opening message
eml_file = '/path/to/file.eml'
message = email.message_from_file(open(eml_file))

# Printing content types
for part in message.walk():
    print(part.get_content_type())

>multipart/mixed
 text/plain
 message/rfc822
 application/pkcs7-mime

The part that is giving the problems is "application/pkcs7-mime". Next I try to extract data from the payload.

# Ensuring we got the right payload
message.get_payload(1).get_payload(0).get_content_type()
>application/pkcs7-mime

# Getting payload
message.get_payload(1).get_payload(0).get_payload()
MIAGCSqGSIb3DQEHAqCAMIACAQExDzANBglghkgBZQMEAgEFADCABgkqhkiG9w0BBwGggxDhqwSD
EOGmQ29udGVudC1UeXBlOiBtdWx0aXBhcnQvbWl4ZWQ7IGJvdW5kYXJ5PV8wMDRfRkVGNjMyNEIy
NDFFNDA5MEFBMUQ3RUQ5QUZBQkZFMDdic2Vka187DQogbWljYWxnPXNoYTI1Ng0KDQotLV8wMDRf
RkVGNjMyNEIyNDFFNDA5MEFBMUQ3RUQ5QUZBQkZFMDdic2Vka18NCkNvbnRlbnQtVHlwZTogdGV4
dC9odG1sOyBjaGFyc2V0PXV0Zi04DQpDb250ZW50LUlEOiA8Mzg3NTg2Nzk0M0UzNkY0OUEzQjcy
.......

It looks like the payload is encoded in base64, so I try to decode it:

# Decoding message
message.get_payload(1).get_payload(0).get_payload(decode=True).decode("iso-8859-1")

# Top of output
 á«á¦Content-Type: multipart/mixed; boundary=_004_FEF6324B241E4090AA1D7ED9AFABFE07bsedk_;
 micalg=sha256

--_004_FEF6324B241E4090AA1D7ED9AFABFE07bsedk_
Content-Type: text/html; charset=utf-8
Content-ID: <[email protected]>
Content-Transfer-Encoding: base64

PGh0bWwgeG1sbnM6bz0idXJuOnNjaGVtYXMtbWljcm9zb2Z0LWNvbTpvZmZpY2U6b2ZmaWNlIiB4
bWxuczp3PSJ1cm46c2NoZW1hcy1taWNyb3NvZnQtY29tOm9mZmljZTp3b3JkIiB4bWxuczptPSJo
dHRwOi8vc2NoZW1hcy5taWNyb3NvZnQuY29tL29mZmljZS8yMDA0LzEyL29tbWwiIHhtbG5zPSJo

# Bottom of output
0U 0U00, * (&http://crl.oces.trust2408.com/oces.crl0\ Z X¤V0T10 UDK10U
×ÿÑE|g¢¿ºÊ]6¾ãñJfYéÿBû´s²J7ª¡-*¨    Ø p@6¨Ë9ǦíæýÕUB9Íõ £àg&7î!Ö®?sÒ
8Wè>º|mIh +|2¨Ü%«O´|%o¬¦¥ð¶oòÝ¿t³ é¾ÏPÒëqûW®ö8|¼èÇçqQL ½þ¹Ù`¶¿*7ÒýsÎKb!ÿ¶?:{¸½Ú\õ{ÕþyÅI
.­éÃÂKèj36q°lD}|RÎàIzÙ/^[j;3¶Ðà+®r$¥!1ß0Û0I0A10 UDK10UªÏÒ¶,`§rÌbèñy5&FJÃ(êí°&¬,ÚêX$á.=ïïæN$®]Þ½yU0+*§ÔÚ´í6azL(!DÏÝ6ÂNê,Ä5FsíXEó§î_»SG]Úüåt Ô¼'âröÓg!ðSÐ,O¶x><s5ÖRv«N¸¡¿<ý>¼VBñ¤f[ÔÏàø7¿ÂûÊ
uidÊUS!ÂÕÜÚæòÜíþµüâeüLü'^[¦/d{«oäp¹ÁNî÷Ž½Oq$Øà-W
DsüèXÀÎ}á¾9À̹ÙÎhAÎ ¯P¾ñäK!VIrÏ̯Ín,º~¿IÐ{[D¢=ý%Úîckr¿·_³EÙ]¨g0âk:`vÄÖ</È»HBà4%ª]|;~wÐ=·&;êºÕAr«Á¡GÅòØ)wÂd{Ù    BvÞ·3ºàCÔ

Conclusion

Some of the message is decoded correctly, while the rest is a mess. I can, however, not figure out where things go wrong.


Solution

  • For now I have fixed the problem by writing handling these files, by removing the junk from the header, and converting them back to messages. This seems to work for now.

        if message.get_filename() == "smime.p7m":
            message = email.message_from_bytes(re.sub(r'.*Content-Type:', 'Content-Type:', decoded).encode("iso-8859-1"))