Search code examples
phphtmlmysqlparsingdrupal-6

How can I extract multiple HTML tags from a MySQL table


I have a table in a MySQL based CMS, one of whose fields contains the text of articles displayed in the CMS web pages.

Some of the articles contain images embedded in the text, in the form of HTML 'img' tags. There may be one or several images in the text contained in the field.

What I want to do is to create a query that will extract a list of all the images in all the articles. I have managed to create some code as follows:

SELECT nid, 
substr(body,locate('<img', body),(locate('>',body,locate('<img', body)) - locate('<img', body))) as image,
body FROM `node_revisions` where body like '%<img%'

and this seems to work ok, however of course it only extracts the first image and I would really like to extract all of them (in fact of course this would generally mean using a loop but that doesn't seem possible in MySQL).

Just for reference, the CMS in question is Drupal 6, hence the names of the fields and table. However, this is really a question about MySQL not Drupal which is why I'm asking here not on the Drupal Stackexchange site.


Solution

  • You will drive yourself insane trying to use locate(), substring(), or regular expressions to parse HTML or XML. See https://blog.codinghorror.com/parsing-html-the-cthulhu-way/

    I suggest you use PHP's DOMDocument class:

    <?php
    
    $bodyHtml = "now is the time for all <img src='good.jpg'> men to come to the <img src='aid.jpg'> of their country";
    
    $dom = new DOMDocument();
    $dom->loadHTML($bodyHtml);
    $imgs = $dom->getElementsByTagName("img");
    foreach ($imgs as $img) {
            print "$img->nodeName\n";
            foreach ($img->attributes as $attr) {
                    print "  $attr->name=$attr->value\n";
            }
    }
    

    Outputs:

    img
      src=good.jpg
    img
      src=aid.jpg