Search code examples
htmlparsinghtml-parsing

How to Remove HTML element by class name


I'm changing a database using phpmyadmin with several html pages inside it and I would like to remove, from all these pages, all the <div> and other tags that contain a certain class or id.

Example:

Case 1

<div class="undesirable">
  <div class="container">
    <div class="row">
      <div class="col1"></div> 
    </div>
   </div>
</div>

Case 2

<div class="undesirable">
  <div class="container">
    <div class="row">
      <div class="col1"></div>
      <div class="col2"></div> 
    </div>
   </div>
</div>

i would like to remove all <div> that contain the class="undesirable". In some cases, there is still the possibility of appearing as class="pre_undesirable", or something similar.

Initially I thought of using regex, but as there are variations in htmls, code breaks are occurring, as there is no way to know when the <\div> will end. Possibly the answer would be HTML parser, but I can't understand how to use it. Any indication of where to start?


Solution

  • Since you are dealing with html, you probably should use an html parser and search for the removal target using xpath. To demonstrate, I'll change your html a bit:

    $original= 
    '<html><body>
    <div class="undesirable">
      <div class="container">
        <div class="row">
          <div class="col1"></div> 
        </div>
       </div>
    </div>
    <div class="keepme">
      <div class="container">
        <div class="row">
          <div class="col1"></div>
          <div class="col2"></div> 
        </div>
       </div>
    </div>
    
    <div class="pre_undesirable">
      <div class="container">
        <div class="row">
          <div class="col1"></div>
          <div class="col2"></div> 
        </div>
       </div>
    </div>
    <div class="keepme">
      <div class="container">
        <div class="row">
          <div class="col1"></div>
          <div class="col2"></div> 
        </div>
       </div>
    </div>
    </body>
    </html>
    ';
    $HTMLDoc = new DOMDocument();
    $HTMLDoc->loadHTML($original);
    $xpath = new DOMXPath($HTMLDoc);
    
    $targets = $xpath->query('//div[contains(@class,"undesirable")]');
    foreach($targets as $target){
            $target->parentNode->removeChild($target);
    }
    echo $HTMLDoc->saveHTML();
    

    The output should include only the two "keep me" <div>s.