I would like to clean a lot of mails'HTML body which are a bit dirty (taken from Gmail-sent emails): there are lots of nested <div>
, unwanted changes of fonts, etc.
I would like to clean this and keep only <a>
, <b>
, <br>
, <i>
, <img>
, and nothing else (and maybe also <p>
or a few <div>
if and only if it's really necessary).
With the regex /<\/?(?!(a|br|b|img)\b)\w+[^>]*>/g
, it works most of the time:
document.onclick = function() {
document.body.innerHTML = document.body.innerHTML.replace(/<\/?(?!(a|br|b|img)\b)\w+[^>]*>/g, '');
}
<div dir="ltr"><div class="gmail_quote"><div dir="ltr">Hello,<div><br></div><div><div><div style="font-size:12.8px"><span style="font-size:12.8px">Thank you for your message.</span><br></div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px"><span style="font-size:12.8px">If the L<span class="m_-527331299899979m_70391001927gmail-il">orem</span>i</span><span class="m_-527331299899979m_703910001927gmail-m_2466414472930393055gmail-il" style="font-size:12.8px">psum</span><span style="font-size:12.8px"> bla bla </span><a href="http://example.com" style="font-size:12.8px" target="_blank">test</a><span style="font-size:12.8px"> window, then it will be like this.</span><br></div><div style="font-size:12.8px">Blah blah.</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">Lorem ipsum<span style="font-size:12.8px">lorem ipsum </span><span style="font-size:12.8px">blah blah and</span><span style="font-size:12.8px"> you can </span><span style="font-size:12.8px">also <i>blah blah</i> and finally <i>Blah</i>.</span></div><div style="font-size:12.8px"><span style="font-size:12.8px"><br></span></div><div style="font-size:12.8px"><span style="font-size:12.8px">-----------</span></div><div style="font-size:12.8px"><span style="font-size:12.8px"><br></span></div><div style="font-size:12.8px"><span style="font-size:12.8px">Examples:</span></div><div style="font-size:12.8px"><span style="font-size:12.8px"><br></span></div><div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test1</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test2</a></span></div><div><br></div><div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test3</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test4</a></span></div></div><div><br></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test4</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test5</a></span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">example</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">ex<wbr>ample</a></span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">example</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">exam<wbr>ple</a></span></div><div><span style="font-size:12.8px"><br></span></div><div><br></div></div></div><div class="gmail_extra" style="font-size:12.8px"><div class="m_-52733129979m_703911927gmail-m_24664144055gmail_signature"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><span style="font-size:small">Sincerly,</span><br></div></div></div></div></div></div></div></div><div><div><div class="m_-52722719979m_7039100982345401927gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><br></div><div>Myself<br></div><div dir="ltr"><br><b>example</b><br>web: <a href="http://www.example.com" target="_blank">www.example.com</a><br></div><div>fb: <a href="http://www.facebook.com/example/" target="_blank">www.facebook.com/LoremIp<wbr>sum/</a><br></div><div>mail: <a href="mailto:[email protected]" target="_blank">[email protected]</a><br></div><div dir="ltr"><br><img src="http://example.com/example.png"><br></div></div></div></div></div></div></div></div></div></div></div></div></div></div><br></div>
(Click anywhere in the email after having run the Code Snippet to see what happens after the regex is applied)
Indeed:
<span>
or </span>
are successfully removed <div fontstyle="...">
and </div>
are removedBut there is a remaining problem when removing <div>
like this:
Empty lines are removed (see empty line between line 1 and 3 of the mail output, between line 3 and 5, etc.)
The newline is removed after each example: test1
(see when you run Code Snippet)
I tried to replace <div.*?><br></div>
by <br><br>
but it's still not correct.
Question: How to clean this HTML code, discard the unwanted font changes, etc., and keep the same empty lines, and keep <a>
, <b>
, <br>
, <i>
, <img>
tags?
Note: it has to finally run in a Google Apps Script, so I'm not sure it's possible to import third-party JS libraries...
The following 5-step process works for the sample you provided:
<div><br></div>
with <br><br>
</div>
tags, possibly preceded by <br>
, with a single <br>
.<br>
rags with two <br>
tags. Code:
document.onclick = function() {
document.body.innerHTML = document.body.innerHTML
.replace(/<\/?(?!(a|br|b|i|img|div)\b)\w+[^>]*>/g, '')
.replace(/<div[^>]*><br><\/div>/g, '<br><br>')
.replace(/((<br>)?<\/div>)+/g, '<br>')
.replace(/<div[^>]*>/g, '')
.replace(/(<br>){2,}/g, '<br><br>');
}
<div dir="ltr"><div class="gmail_quote"><div dir="ltr">Hello,<div><br></div><div><div><div style="font-size:12.8px"><span style="font-size:12.8px">Thank you for your message.</span><br></div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px"><span style="font-size:12.8px">If the L<span class="m_-527331299899979m_70391001927gmail-il">orem</span>i</span><span class="m_-527331299899979m_703910001927gmail-m_2466414472930393055gmail-il" style="font-size:12.8px">psum</span><span style="font-size:12.8px"> bla bla </span><a href="http://example.com" style="font-size:12.8px" target="_blank">test</a><span style="font-size:12.8px"> window, then it will be like this.</span><br></div><div style="font-size:12.8px">Blah blah.</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">Lorem ipsum<span style="font-size:12.8px">lorem ipsum </span><span style="font-size:12.8px">blah blah and</span><span style="font-size:12.8px"> you can </span><span style="font-size:12.8px">also <i>blah blah</i> and finally <i>Blah</i>.</span></div><div style="font-size:12.8px"><span style="font-size:12.8px"><br></span></div><div style="font-size:12.8px"><span style="font-size:12.8px">-----------</span></div><div style="font-size:12.8px"><span style="font-size:12.8px"><br></span></div><div style="font-size:12.8px"><span style="font-size:12.8px">Examples:</span></div><div style="font-size:12.8px"><span style="font-size:12.8px"><br></span></div><div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test1</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test2</a></span></div><div><br></div><div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test3</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test4</a></span></div></div><div><br></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test4</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test5</a></span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">example</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">ex<wbr>ample</a></span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">example</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">exam<wbr>ple</a></span></div><div><span style="font-size:12.8px"><br></span></div><div><br></div></div></div><div class="gmail_extra" style="font-size:12.8px"><div class="m_-52733129979m_703911927gmail-m_24664144055gmail_signature"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><span style="font-size:small">Sincerly,</span><br></div></div></div></div></div></div></div></div><div><div><div class="m_-52722719979m_7039100982345401927gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><br></div><div>Myself<br></div><div dir="ltr"><br><b>example</b><br>web: <a href="http://www.example.com" target="_blank">www.example.com</a><br></div><div>fb: <a href="http://www.facebook.com/example/" target="_blank">www.facebook.com/LoremIp<wbr>sum/</a><br></div><div>mail: <a href="mailto:[email protected]" target="_blank">[email protected]</a><br></div><div dir="ltr"><br><img src="http://example.com/example.png"><br></div></div></div></div></div></div></div></div></div></div></div></div></div></div><br></div>