GREP date from email header and make it the files creation date

I am on Mac Terminal and want to "grep" a string (which is a UNIX timestamp) out of an email header, convert that into a format the OS can work with and make that the creation date of the file. I want to do that recursively for all mails inside a folder (with multiple possible subfolders).

The structure would probably look something like this:

#!/bin/bash

for i in `ls`
do
  # Find the date field (X-Delivery-Time) inside an email header and grep the UNIX timestamp
  # convert timestamp to a format the OS can work with
  # overwrite the existing creation date with the new one
done

The mails header look like this

X-Envelope-From: <some@mail.com>
X-Envelope-To: <my@mail.com>
X-Delivery-Time: 1535436541
...

Some background: Apple Mail uses the date a file was created as the date displayed within Apple Mail. That’s why after moving mails from one server to another all mails now display the same date which makes sorting impossible.

As I am new to Terminal/Bash any help is appreciated. Thanks

Solution

What follows assumes you are using the default macOS utilities (touch, date...) As they are completely outdated some adjustments will be needed if you use more recent versions (e.g. macports or brew). It also assumes that you are using bash.

If you have sub-folders ls is not the right tool. And anyway, the output of ls is not for computers, it is for humans. So, the first thing to do is find all email files. Guess what? The utility that does this is named find:

$ find . -type f -name '*.emlx'
foo/bar.emlx
baz.emlx
...

searches for true files (-type f) starting from the current directory (.) and which name is anything.emlx (-name '*.emlx'). Adapt to your situation. If all files are email files you can skip the -name ... part.

Next we need to loop over all these files and process each of them. This is a bit more complex than for f in ... for several reasons (large number of files, file names with spaces...) A robust way to do this is to redirect the output of a find command to a while loop:

while IFS= read -r -d '' f; do
  <process file "$f">
done < <(find . -type f -name '*.emlx' -print0)

The -print0 option of find is used to separate the file names with a null character instead of the default newline character. The < <(find...) part is a way to redirect the output of find to the input of the while loop. The while IFS= read -r -d '' f; do reads each file name produced by find, stores it in shell variable f, preserving the leading and trailing spaces if any (IFS=), the backslashes (-r) and using the null character as separator (-d '').

Now we must code the processing of each file. Let's first retrieve the delivery time, assuming it is always the second word of the last line starting with X-Delivery-Time::

awk '/^X-Delivery-Time:/ {t = $2} END {print t}' "$f"

does that. If you don't know awk already it's time to learn a bit of it. It's one of the very useful Swiss knives of text processing (sed is another). But let's improve it a bit such that it returns the first encountered delivery time instead of the last, stops as soon as it encountered it, and also checks that the timestamp is a real timestamp (digits):

awk '/^X-Delivery-Time:[[:space:]]+[[:digit:]]+$/ {print $2; exit}' "$f"

The [[:space:]]+ part of the regular expression matches 1 or more spaces, tabs,... and the [[:digit:]]+ matches 1 or more digits. ^ and $ match the beginning and the end of the line, respectively. The result can be assigned to a shell variable:

t="$(awk '/^X-Delivery-Time:[[:space:]]+[[:digit:]]+$/ {print $2; exit}' "$f")"

Note that if there was no match the t variable will store the empty string. We will use this later to skip such files.

Once we have this delivery time, which looks like a UNIX timestamp (seconds since 1970/01/01) in your example, we must use it to change the last modification time of the email file. The command that does this is touch:

$ man touch
...
touch [-A [-][[hh]mm]SS] [-acfhm] [-r file] [-t [[CC]YY]MMDDhhmm[.SS]] file ...
...

Unfortunately touch wants a time in the CCYYMMDDhhmm.SS format. No worry, the date utility can be used to convert a UNIX timestamp in any format we like. For instance, with your example timestamp (1535436541):

$ date -r 1535436541 +%Y%m%d%H%M.%S
201808280809.01

We are almost done:

while IFS= read -r -d '' f; do
  # uncomment for debugging
  # echo "processing $f"
  t="$(awk '/^X-Delivery-Time:[[:space:]]+[[:digit:]]+$/ {print $2; exit}' "$f")"
  if [ -z "$t" ]; then
    echo "no delivery time found in $f"
    continue
  fi
  # uncomment for debugging
  # echo touch -t "$(date -r "$t" +%Y%m%d%H%M.%S)" "$f"
  touch -t "$(date -r "$t" +%Y%m%d%H%M.%S)" "$f"
done < <(find . -type f -name '*.emlx' -print0)

Note how we test if t is the empty string (if [ -z "$t" ]). If it is, we print a message and jump to the next file (continue). Just put all this in a file with a shebang line and run...

If, instead of the X-Delivery-Time field, you must use a Date field with a more complex and variable format (e.g. Date: Mon, 11 Jun 2018 10:36:14 +0200), the best would be to install a decently recent version of touch with the coreutils package of Mac Ports or Homebrew. Then:

while IFS= read -r -d '' f; do
  t="$(awk '/^Date:/ {print gensub(/^Date:[[:space:]+](.*)$/,"\\1","1"); exit}' "$f")"
  if [ -z "$t" ]; then
    echo "no delivery time found in $f"
    continue
  fi
  touch -d "$t" "$f"
done < <(find . -type f -name '*.emlx' -print0)

The awk command is slightly more complex. It prints the matching line without the Date: prefix. The following sed command would do the same in a more compact form but would not really be more readable:

t="$(sed -rn 's/^Date:\s*(.*)/\1/p;Ta;q;:a' "$f")"