When sending an email, many servers add additional line breaks to limit the length of each line.
How can the original line breaks be recovered when fetching the email in a PHP script?
Suppose I send the following content:
Lorem ipsum Dolore incididunt in culpa ea ea sed quis sint voluptate quis laborum ullamco Excepteur do adipisicing consequat ex in reprehenderit officia in ad deserunt magna nulla dolor laborum occaecat reprehenderit aliquip dolor ea anim ea in veniam adipisicing culpa tempor qui elit voluptate consectetur elit laboris minim consectetur laboris anim incididunt Ut sunt sunt mollit elit irure do cillum dolore consequat in ea culpa ut velit sunt nulla in dolore voluptate dolore laborum reprehenderit dolore ut.
Ut non in veniam enim minim elit ad ut id ad eu voluptate cillum dolor laboris irure tempor mollit dolore exercitation eiusmod ea non ea ullamco nostrud cillum nostrud laborum commodo esse reprehenderit ut deserunt officia do in anim dolore ullamco pariatur ex amet nulla Excepteur mollit officia fugiat eu sed quis nisi fugiat dolor ea commodo ut sunt in consequat consectetur ut nulla pariatur est dolor dolore non ut occaecat officia Duis Ut ex exercitation esse ullamco nulla incididunt commodo pariatur dolore nostrud fugiat id dolor minim non sint amet adipisicing occaecat enim non Ut ad irure sint aliquip nisi ut commodo minim proident elit nulla quis ut ad dolor Excepteur dolore Duis.
Note that there is just one single line break in this text!
Checking the source code of the email at the receiving end using Thunderbird, or fetching the email body via PHP, the content is formatted like this:
Lorem ipsum Dolore incididunt in culpa ea ea sed quis sint voluptate
quis laborum ullamco Excepteur do adipisicing consequat ex in
reprehenderit officia in ad deserunt magna nulla dolor laborum occaecat
reprehenderit aliquip dolor ea anim ea in veniam adipisicing culpa
tempor qui elit voluptate consectetur elit laboris minim consectetur
laboris anim incididunt Ut sunt sunt mollit elit irure do cillum dolore
consequat in ea culpa ut velit sunt nulla in dolore voluptate dolore
laborum reprehenderit dolore ut.
Ut non in veniam enim minim elit ad ut id ad eu voluptate cillum dolor
laboris irure tempor mollit dolore exercitation eiusmod ea non ea
ullamco nostrud cillum nostrud laborum commodo esse reprehenderit ut
deserunt officia do in anim dolore ullamco pariatur ex amet nulla
Excepteur mollit officia fugiat eu sed quis nisi fugiat dolor ea commodo
ut sunt in consequat consectetur ut nulla pariatur est dolor dolore non
ut occaecat officia Duis Ut ex exercitation esse ullamco nulla
incididunt commodo pariatur dolore nostrud fugiat id dolor minim non
sint amet adipisicing occaecat enim non Ut ad irure sint aliquip nisi ut
commodo minim proident elit nulla quis ut ad dolor Excepteur dolore Duis.
Note that each line is limited to a certain length, so 16 additional line breaks are present. These additional line breaks were automatically added somewhere in the chain of events leading to me receiving the email.
I want my email-fetching PHP script to remove the additional line breaks to restore the original two-line format of the content.
I know that the new line breaks are not added in by the PHP script, I know where they come from, what I do not know is how I could make my PHP script remove those line breaks.
Here is the code used to fetch the email body:
$connection = imap_open(
sprintf(
'{%s:110/pop3}INBOX',
Configure::read('Email.Inbox.host')
),
Configure::read('Email.Inbox.email'),
Configure::read('Email.Inbox.password')
);
$mailbox = imap_check($connection);
$messages = imap_fetch_overview($connection, '1:' . $mailbox->Nmsgs);
foreach($messages as $message) {
$content = imap_fetchbody($connection, $message->msgno, 1);
}
What have I tried?
I tried using imap_body
instead of imap_fetchbody
, as the former does not process the email body. But the additional line breaks are already present before that and are indistinguishable from the regular line breaks. Both consist of \r\n
.
I assume there has to be a way to do this, as Thunderbird displays the received email with the correct formatting, without the additional 16 line breaks, although they are present in the source code of the displayed message. So there probably has to be a way to strip the additional 16 line breaks from the email.
Here is a screenshot from Thunderbird which shows the source code of the email on the top and the resulting plain-text display on the bottom.
Even though this question is old, it was one of the top hits when I ran into this exact same problem. As Marc pointed out in the comments, it does have to do with format=flowed
. So I dove into RFC 2646 and found section 4.1, Generating Format=Flowed:
Because a soft line break is a SP CRLF sequence, the generating agent creates one by inserting a CRLF after the occurance of a space.
A generating agent SHOULD NOT insert white space into a word (a sequence of printable characters not containing spaces). If faced with a word which exceeds 79 characters (but less than 998 characters, the [SMTP] limit on line length), the agent SHOULD send the word as is and exceed the 79-character limit on line length.
So in order to get the email as it was originally written, simply search for all SP+CRLF occurrences and replace them with nothing. Then you might also wanna undo the space-stuffing, while also accounting for quoted text (lines starting with any number of >
chars followed by a space). According to the RFC, the order of tests is quotation marks > space stuffing > flowed lines:
On reception, if the first character of a line is a space, it is logically deleted. This occurs after the test for a quoted line, and before the test for a flowed line.
A crude PoC from my own kitchen:
// I'm using fetchmime() because I want to be sure I'm getting the proper MIME type for the relevant section
$mimes = imap_fetchmime($connection, $message->msgno, $section);
// I don't want to store all headers in an array since I just want to know the Content-Type
// [ \t]* is probably not necessary but it's there in case of broken clients/servers
if(preg_match('/^[ \t]*Content-Type.*format=flowed\b/mi', $mimes)) {
// First, let's undo space stuffing but don't touch stuffed lines with quotes
$content = preg_replace('/^ +(?!>+ )/m', '', $content);
// Then, remove flowed SP+(CR)LF sequences as well as any possible quotation marks that might appear after it to reform one long line of text
$content = preg_replace('/( )\r?\n(>+ +)?/', '$1', $content);
// Remove empty quoted lines at *the end of the string only*, keeping any such lines anywhere else as-is for readability
$content = preg_replace('/(\r?\n>+\s*)+$/', '', $content);
}
// And finally trim the entire thing (regardless of formatting)
$content = trim($content);
// Or when outputting to browsers:
//$content = nl2br(trim($content));
For me this works just fine on: