Search code examples
phphtmlparsingimap

PHP IMAP How to get just the text-part of body? Not the different <html> tags etc


I'm trying to write a script that downloads email from an exchange server and later inserts that into an database, but I'm having trouble getting the 'text part' of the emails in a good way.

phpcode

<?PHP
$user = "[email protected]";
$password = "password123";
$mbox = imap_open("{exchange01:993/imap/ssl/novalidate-cert}", $user, $password);

$message = imap_fetchbody($mbox,1,1);

print_r($message);

if($mbox)
{
    imap_close($mbox);
};
?>

and the entire html body gets printed. I guess thats to be expected, but I'd like to not have the

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=iso-8859-1"><meta name=Generator content="Microsoft Word 15 (filtered medium)"><!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]--><style><!--
/* Font Definitions */
@font-face
    {font-family:"Cambria Math";
    panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
    {font-family:Calibri;
    panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
    {font-family:Verdana;
    panose-1:2 11 6 4 3 5 4 4 2 4;}
@font-face
    {font-family:"Neo Sans Std";}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
    {margin:0cm;
    margin-bottom:.0001pt;
    font-size:11.0pt;
    font-family:"Calibri",sans-serif;
    mso-fareast-language:EN-US;}
a:link, span.MsoHyperlink
    {mso-style-priority:99;
    color:#0563C1;
    text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
    {mso-style-priority:99;
    color:#954F72;
    text-decoration:underline;}
span.E-postmall17

....mumbojumbo, just the text in the email itself (I can live with having signature and images and this and that).

Is there no easier way than somewhat roughly cutting the long string up at <body... to </body... and then cutting it further from there? There must be other people who've wanted to solve the same problem but I'm unable to find any answer after spending an entire day trying to solve it and google:ing it.

I guess in the end I'll just insert the entire htmlresponse into the database cell and hope for the best, but I'd rather not.

Help me, Stackoverflow. You're my only hope

Solution edit:

Not the exact solution I would've liked, but it does work (with some slight fixing to do).

echo strip_tags($message, '<body>');

Outputs just the

<body...>
Yayh the text i want!
</body .....>

part. Thanks alot @ThisGuyHasTwoThumbs (In comments)

Edit:

In the end the code became roughly this

<?PHP
$user = "[email protected]";
$password = "password";
$mbox = imap_open("{exchange01:993/imap/ssl/novalidate-cert}", $user, $password);

$message = imap_fetchbody($mbox,1,1);

$message = strip_tags($message, '<body>');
$message = explode(">", $message);
$message = explode("<", $message[1]);
$message = str_replace("&nbsp;", "", $message[0]);
$message = html_entity_decode($message);
$message = trim($message);
//Or the above three combined in one row
#$message = trim(html_entity_decode( str_replace("&nbsp;", "", $message[0])));

echo $message;

if($mbox)
{
    imap_close($mbox);
};
?>

Which removes the first <body something something something> and the </body> at the end and after that removes the whitespace in the beginning and end of the variable. (Which @Goose also kinda answered in his edited answer below). It also converts html-encoded 'letters to the corresponding ones as well as removes the &nbsp tags and such.


Solution

  • What you want is strip_tags()

    http://php.net/manual/en/function.strip-tags.php

    $html = '<div>hello</div>';
    $text = strip_tags($html);
    echo $text; // hello
    

    If you need to remove excess white space from the resulting string, use this. This will also remove new lines. Credit to Remove excess whitespace from within a string

    $text = preg_replace('/\s+/', ' ', $text);