Search code examples
phpregexpreg-matchtext-extractiontext-parsing

Parse email string with metadata and get the From and Cc values


I´m trying to get the email from and cc from a forwarded email, when the body looks like this:

$body = '-------
Begin forwarded message:


From: Sarah Johnson <[email protected]>

Subject: email subject

Date: February 22, 2013 3:48:12 AM

To: Email Recipient <[email protected]>

Cc: Ralph Johnson <[email protected]>


Hi,


hello, thank you and goodbye!

 [email protected]'

Now, when I do the following:

$body = strtolower($body);
$pattern = '#from: \D*\S([\w-\.]+)@((?:[\w]+\.)+)([a-zA-Z]{2,4})\S#';
if (preg_match($pattern, $body, $arr_matches)) {
     echo htmlentities($arr_matches[0]);
     die();
}

I correctly get:

from: sarah johnson <[email protected]>

Now, why does the cc don't work? I do something very similar, only changing from to cc:

$body = strtolower($body);
$pattern = '#cc: \D*\S([\w-\.]+)@((?:[\w]+\.)+)([a-zA-Z]{2,4})\S#';
if (preg_match($pattern, $body, $arr_matches)) {
     echo htmlentities($arr_matches[0]);
     die();
}

and I get:

cc: ralph johnson <[email protected]> hi, hello, thank you and goodbye! [email protected]

If I remove the email from the original body footer (removing [email protected]) then I correctly get:

cc: ralph johnson <[email protected]>

It looks like that email is affecting the regular expression. But how, and why doesn't it affect it in the from? How can I fix this?


Solution

  • The problem is, that \D* matches too much, i.e. it is also matching newline characters. I would be more restrictive here. Why do you use \D(not a Digit) at all?

    With e.g. [^@]* it is working

    cc: [^@]*\S([\w-\.]+)@((?:[\w]+\.)+)([a-zA-Z]{2,4})\S
    

    See it here on Regexr.

    This way, you are sure that this first part is not matching beyond the email address.

    This \D is also the reason, it is working for the first, the "From" case. There are digits in the "Date" row, therefore it does not match over this row.