Retrieve email subject from file via bash

I've a shell script which downloads files from the servers mail folder to a NAS devices so the client has copies made locally with a backup.

The files are saved as 11469448248.H15587P19346.smtp.x14.eu_2,S files. I've changed the extension to standard .eml format so email clients can read them from disc.

for f in *.smtp.x14.eu_2,S; do
#sed "9q;d" $f
#tail -n+9 $f | head -n1
mv -- "$f" "${f%.smtp.x14.eu_2,S}.eml";
done

As you can see I've tried to use the sed and tail command to get the 9th line from the file; problem is that the subject isn't always on the 9th line and the file names don't say much of its content.

I'm trying to get the files names to be in a understandable format, so I figured the subject could be helpful.

On the nth line of the email file is a line that begins with Subject: PD: the subject

Im trying to find this line fet rid of Subject: PD: and leave the rest as the ne file name

Solution

The following is wrong but implements what you seem to be asking.

subj=$(sed -n '/^Subject: PD *//!d;p;q;/^$/q' "$f")

The problem with this is that it succeeds in the trivial case, but fails when you have a MIME RFC2047-encoded Subject: header, and (more trivially) when the Subject: header spans more than a single line.

I would approach this with a slightly more modern programming language. It's not quite a one-liner, but it's easy enough with Python.

subj=$(./emailsubj.py "$f")

where emailsubj.py contains something more or less like

#!/usr/bin/env python
from email.parser import Parser
from email.header import Header, decode_header
from sys import argv

for filename in argv[1:]:
    with open(filename, 'rb') as handle:  # handle file not found etc?
        message = Parser().parse(handle)
    try:
        subj = ''.join([frag.decode(enc) if enc else frag
            for frag, enc in decode_header(message['subject'])])
    except HeaderParseError, UnicodeDecodeError:
        subj = message['subject']   # maybe warn about error?
    print(subj)

(Remember to chmod +x emailsubj.py, obviously.)

This retrieves the entire Subject: header and seems like a good design for a modular tool. If you want to remove a prefix after extracting the header, the shell has simple facilities for parameter expansion which do exactly that. For example,

echo "${subj#PD:\ }"

displays the value of $subj with any prefix PD: removed from the front of the value.

The above was written for Python 2.7. In Python 3.6+, a much simpler version is sufficient. The following exercises the Python 3.6+ revamped email library (which for the time being requires you to name an explicit policy, but) which is significantly simpler and more versatile.

from email import message_from_binary_file
from email.policy import default

from sys import argv

for filename in argv[1:]:
    with open(filename, 'rb') as handle:  # handle file not found etc?
        message = message_from_binary_file(handle, policy=default)
    print(message['subject'])

With Python 3, the output on any sane platform should be fine simply with print. Of course, if your terminal can't print Unicode (in which case you are probably on Windows) it could fail for that reason.