How to remove unnecessary tags from html generate from text editor

Below is a html script auto generated from word document by text editor, summernote.

var html = `
<p>
   <b>
   <br>
   </b>
</p>
<p>
   <b>អ្នកធានា</b>
</p>
<p>
   <b>ឈ្មោះ: ……………………………</b>
</p>
<p>
   <b>អត្តសញ្ញាណប័ណ្ណលេខៈ………………...............
   <span style="white-space:pre"></span>..........................................
   </b>
</p>
<p>
   <b>
   <span style="white-space:pre"></span>ហត្ថលេខានិង ស្នាមមេដៃស្តាំ
   <span style="white-space:pre"></span>
   </b>
</p>
<p>
   <b>
   <br>
   </b>
</p>
<p>`;

After it generate hmlt code for me, I tried to clean it up by removing an unnecessary, empty tags, and tags that does not contain any value.

So, I tried my JS script as below:

html.replace('<p><br></p>', ''); // remove unneccessary tage
html.replace('&nbsp;', ''); // remove &nbsp; space
console.log(html);

However, after JS script above nothing change, the empty and unnecessary tags still exist.

I don't why it does not work,but I tried just very simple replace ' not replaced'.replace(' ',''), it work just fine.

What's wrong with above? How can I remove all unnecessary tags from above? Thanks.

Solution

Your replace line isn't working because it doesn't match the exact structure of your HTML and doesn't account for the whitespace between the tags. You can take care of the whitespace by using a RegExp in your replace call, like this:

html.replace(/<p>\s*<br>\s*<\/p>/, '');
//           /                    start of the regex literal
//            <p>                 a literal "<p>"
//               \s               any whitespace character
//                 *              previous char, zero or more times
//                  <br>          a literal "br"
//                      \s        any whitespace character
//                        *       previous char, zero or more times
//                         <\/p>  a literal "</p>" (with escaped slash)
//                              / end of regex

That would match  , but the s inside your s foul it. You can make ever-more complicated regexes to handle more and more esoteric situations, but that way lies madness and isn't possible in the general case.

Instead, we can pull the generated HTML into a DocumentFragment. Then we can work with it as a DOM tree, not a string:

const template = document.createElement('template');
template.innerHTML = html;
const fragment = template.content;
removeUselessNodes(fragment); // we'll need to write this one

The <template> HTMLTemplateElement helps us here, because we can assign an HTML string to its innerHTML property and pull it back out as a DocumentFragment from the content property. If we change the structure of the DocumentFragment, those changes will be reflected in the innerHTML property.*

^{*I can't find documentation backing me up on that, but it Works For Me in Firefox and Chromium.}

Now we need to actually remove the "unnecessary, empty tags, and tags that [do] not contain any value." We'll define useless nodes to help do that:

Comment nodes are useless.

Text nodes that are empty or contain only whitespace are useless.

Non-void element nodes whose child nodes contain only useless nodes or   elements are useless.

All other nodes are not useless.

We need a function to identify and remove the useless nodes. Since we want to search the entire tree for useless nodes, we'll call the function recursively on the node's child nodes:

function removeUselessNodes(node) {
  for (let i = node.childNodes.length - 1; i >= 0; --i) {
    removeUselessNodes(node.childNodes.item(i));
  }

We iterate over the child nodes in reverse because Node.childNodes is a live list, and we'll be removing elements from it. The loop isn't aware of the changes we're making, and would skip elements if we went forwards. Removing elements from the end of the list won't disrupt a backward-iterating loop. We perform the recursive call first because it makes checking on the last useless-node condition easier.

With all the tree-traversal out of the way, we can start in on the useless-node conditions. Let's take them one by one:

Comment nodes are useless.

This one's easy. Nodes have a property indicating their type, nodeType. We can check it and remove the node if it's a comment:

  if (node.nodeType === Node.COMMENT_NODE) {
    node.remove();
    return;
  }

We return immediately after removing a useless node; there's nothing left to do. Next:

Text nodes that are empty or contain only whitespace are useless.

"[E]mpty or contain[s] only whitespace" is another way to say "doesn't contain non-whitespace", which we can test for with RegExp.test.

  if (
    node.nodeType === Node.TEXT_NODE
    && !/\S/.test(node.textContent)
  ) {
    node.remove();
    return;
  }

(\s is a whitespace character, \S (note the capitalization) is a non-whitespace character.)

The last test requires a little unpacking:

Non-void element nodes whose child nodes contain only useless nodes or   elements are useless.

Void elements are elements that cannot have children: things like <img>s and <hr>s. They're not useless; they have meaning on their own. For our purposes, non-void elements need meaningful children to be meaningful. A  by itself just makes some room on the page. Its child text node is where the text comes from. A   isn't useless when adjacent to other nodes, but by itself, it isn't enough to make its parent meaningful.

Breaking this down into individual tests, we get

Must be an element node
Must be non-void
Child nodes must contain only useless nodes or   elements

We've tested for node type before:

  if (
    node.nodeType === Node.ELEMENT_NODE

There's no convenient way to check for void-ness in JavaScript, but the HTML5 spec includes a list of void elements we can check against with the Element.tagName property:

    && ![
      'AREA',
      'BASE',
      'BR',
      'COL',
      'EMBED',
      'HR',
      'IMG',
      'INPUT',
      'LINK',
      'META',
      'PARAM',
      'SOURCE',
      'TRACK',
      'WBR'
    ].includes(node.tagName)

Since we've already removed all the useless child nodes from this node, the node passes the third test if all its children are   elements. childNodes is a NodeList, which doesn't have an every method, but with 0-indexed elements and a length property, we can call Array's every method on it:

    && Array.prototype.every.call(node.childNodes, n => n.tagName === 'BR')
  ) {
    node.remove();
    return;
  }
}

With that, all of fragments useless nodes have been removed. You can either get the resulting HTML from template.innerHTML, or send it straight into another element with document.adoptNode:

const adoptedNode = document.adoptNode(fragment);
document.querySelector('#destination').appendChild(adoptedNode);

Putting it all together:

var html = `
<p>
   <b>
   <br>
   </b>
</p>
<p>
   <b>អ្នកធានា</b>
</p>
<p>
   <b>ឈ្មោះ: ……………………………</b>
</p>
<p>
   <b>អត្តសញ្ញាណប័ណ្ណលេខៈ………………...............
   <span style="white-space:pre"></span>..........................................
   </b>
</p>
<p>
   <b>
   <span style="white-space:pre"></span>ហត្ថលេខានិង ស្នាមមេដៃស្តាំ
   <span style="white-space:pre"></span>
   </b>
</p>
<p>
   <b>
   <br>
   </b>
</p>
<p>`;

function removeUselessNodes(node) {
  for (let i = node.childNodes.length - 1; i >= 0; --i) {
    removeUselessNodes(node.childNodes.item(i));
  }

  if (node.nodeType === Node.COMMENT_NODE) {
    node.remove();
    return;
  }
  
  if (
    node.nodeType === Node.TEXT_NODE
    && !/\S/.test(node.textContent)
  ) {
    node.remove();
    return;
  }

  if (
    node.nodeType === Node.ELEMENT_NODE
    && ![
      'AREA',
      'BASE',
      'BR',
      'COL',
      'EMBED',
      'HR',
      'IMG',
      'INPUT',
      'LINK',
      'META',
      'PARAM',
      'SOURCE',
      'TRACK',
      'WBR'
    ].includes(node.tagName)
    && Array.prototype.every.call(node.childNodes, n => n.tagName === 'BR')
  ) {
    node.remove();
    return;
  }
}

const template = document.createElement('template');
template.innerHTML = html;
const fragment = template.content;
removeUselessNodes(fragment);

document.querySelector('#rawHTML').value = template.innerHTML;
const adoptedNode = document.adoptNode(fragment);
document.querySelector('#destination').appendChild(adoptedNode);

#rawHTML {
  width: 95vw;
  height: 10em;
}

<textarea id="rawHTML"></textarea>
<div id="destination"></div>