Search code examples
phphtmlregexparsingdomparser

How to replace specific text with hyperlinks without modifying pre-existing <img> and <a> tags?


This is the error I am trying to correct

<img class="lazy_responsive" title="<a href='kathryn-kuhlman-language-en-topics-718-page-1' title='Kathryn Kuhlman'>Kathryn Kuhlman</a> - iUseFaith.com" src="ojm_thumbnail/1000/32f808f79011a7c0bd1ffefc1365c856.jpg" alt="<a href='kathryn-kuhlman-language-en-topics-718-page-1' title='Kathryn Kuhlman'>Kathryn Kuhlman</a> - iUseFaith.com" width="1600" height="517">

If you look carefully at the code above, you will see that the text in the attribute alt and Title were replaced with a link due to the fact that the keyword was in that text. As a result, my image is being displayed like with a tooltip which gives a link instead of just a name like this enter image description here

Problem: I have an array with keywords where each keyword has its own URL which will serve as a link like this:

$keywords["Kathryn Kuhlman"] = "https://www.iusefaith.com/en-354";
$keywords["Max KANTCHEDE"] = "https://www.iusefaith.com/MaxKANTCHEDE";

I have a text with images and links ... where those keywords may be found.

$text='Meet God\'s General Kathryn Kuhlman. <br>
<img class="lazy_responsive" title="Kathryn Kuhlman - iUseFaith.com" src="https://www.iusefaith.com/ojm_thumbnail/1000/32f808f79011a7c0bd1ffefc1365c856.jpg" alt="Kathryn Kuhlman - iUseFaith.com" width="1600" height="517" />
<br>
Follow <a href="https://www.iusefaith.com/en-354" title="Kathryn Kuhlman">Kathryn Kuhlman</a>
<br>
Max KANTCHEDE
';

I want to replace each keyword with a full link to the keyword with the title without replacing the content of href nor the content of alt nor the content of title that is in the text. I did this

$lien_existants = array();

$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";

if(preg_match_all("/$regexp/siU", $text, $matches, PREG_SET_ORDER)) 
{
    foreach($matches as $match) 
    {
        $lien_actuels_existant = filter_var($match[3], FILTER_SANITIZE_STRING);
        $lien_existants [] = trim($lien_actuels_existant);
          
        // $match[2] = link address
        // $match[3] = link text
        
        echo $match[2], '', $match[3], '<br>';
    }
}   

foreach(@$keywords as $name => $value) 
{
    if(!in_array($name, $lien_existants)&&!preg_match("/'/i", $name)&&!preg_match('/"/i', $name))
    {
        $text =  trim(preg_replace('~(\b'. $name.'\b)~ui', "<a href='$value' title='$name'>$1</a>", $text));
    }
    else
    {
        $name = addslashes($name);
        $text =  trim(preg_replace('~(\b'. $name.'\b)~ui', "<a href='$value' title='$name'>$1</a>", $text));
    }
    ######################################### 
}

This replaces the words with links but also replaces it in the attributes alt, title in images.

How to prevent it from replacing the text from alt, title, and href ?

Note I have tried all the other solutions I have found on S.O so if you think one works kindly use my code above and show me how it should be done because if I knew how to make it work I would not be asking it here.


Solution

  • I think @Jiwoks' answer was on the right path with using dom parsing calls to isolate the qualifying text nodes.

    While his answer works on the OP's sample data, I was unsatisfied to find that his solution failed when there was more than one string to be replaced in a single text node.

    I've crafted my own solution with the goal of accommodating case-insensitive matching, word-boundary, multiple replacements in a text node, and fully qualified nodes being inserted (not merely new strings that look like child nodes).

    Code: (Demo #1 with 2 replacements in a text node) (Demo #2: with OP's text)
    (After receiving fuller, more realistic text from the OP: Demo #3 without trimming saveHTML())

    $html = <<<HTML
    Meet God's General Kathryn Kuhlman. <br>
    <img class="lazy_responsive" title="Kathryn Kuhlman - iUseFaith.com" src="https://www.iusefaith.com/ojm_thumbnail/1000/32f808f79011a7c0bd1ffefc1365c856.jpg" alt="Kathryn Kuhlman - iUseFaith.com" width="1600" height="517" />
    <br>
    Follow <a href="https://www.iusefaith.com/en-354" title="Kathryn Kuhlman">Kathryn Kuhlman</a>
    <br>
    Max KANTCHEDE & Kathryn Kuhlman
    HTML;
    
    $keywords = [
        'Kathryn Kuhlman' => 'https://www.example.com/en-354',
        'Max KANTCHEDE' => 'https://www.example.com/MaxKANTCHEDE',
        'eneral' => 'https://www.example.com/this-is-not-used',
    ];
    
    libxml_use_internal_errors(true);
    $dom = new DOMDocument();
    $dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
    
    $xpath = new DOMXPath($dom);
    
    $lookup = [];
    $regexNeedles = [];
    foreach ($keywords as $name => $link) {
        $lookup[strtolower($name)] = $link;
        $regexNeedles[] = preg_quote($name, '~');
    }
    $pattern = '~\b(' . implode('|', $regexNeedles) . ')\b~i' ;
    
    foreach($xpath->query('//*[not(self::img or self::a)]/text()') as $textNode) {
        $newNodes = [];
        $hasReplacement = false;
        foreach (preg_split($pattern, $textNode->nodeValue, 0, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE) as $fragment) {
            $fragmentLower = strtolower($fragment);
            if (isset($lookup[$fragmentLower])) {
                $hasReplacement = true;
                $a = $dom->createElement('a');
                $a->setAttribute('href', $lookup[$fragmentLower]);
                $a->setAttribute('title', $fragment);
                $a->nodeValue = $fragment;
                $newNodes[] = $a;
            } else {
                $newNodes[] = $dom->createTextNode($fragment);
            }
        }
        if ($hasReplacement) {
            $newFragment = $dom->createDocumentFragment();
            foreach ($newNodes as $newNode) {
                $newFragment->appendChild($newNode);
            }
            $textNode->parentNode->replaceChild($newFragment, $textNode);
        }
    }
    echo substr(trim($dom->saveHTML()), 3, -4);
    

    Output:

    Meet God's General <a href="https://www.example.com/en-354" title="Kathryn Kuhlman">Kathryn Kuhlman</a>. <br>
    <img class="lazy_responsive" title="Kathryn Kuhlman - iUseFaith.com" src="https://www.iusefaith.com/ojm_thumbnail/1000/32f808f79011a7c0bd1ffefc1365c856.jpg" alt="Kathryn Kuhlman - iUseFaith.com" width="1600" height="517">
    <br>
    Follow <a href="https://www.iusefaith.com/en-354" title="Kathryn Kuhlman">Kathryn Kuhlman</a>
    <br>
    <a href="https://www.example.com/MaxKANTCHEDE" title="Max KANTCHEDE">Max KANTCHEDE</a> &amp; <a href="https://www.example.com/en-354" title="Kathryn Kuhlman">Kathryn Kuhlman</a>
    

    Some explanatory points:

    • I am using some DomDocument silencing and flags because the sample input is missing a parent tag to contain all of the text. (There is nothing wrong with @Jiwoks' technique, this is just a different one -- choose whatever you like.)
    • A lookup array with lowercased keys is declared to allow case-insensitive translations on qualifying text.
    • A regex pattern is dynamically constructed and therefore should be preg_quote()ed to ensure that the pattern logic is upheld. b is a word boundary metacharacter to prevent matching a substring in a longer word. Notice that eneral is not replaced in General in the output. The case-insensitive flag i will allow greater flexibility for this application and future applications.
    • My xpath query is identical to @Jiwoks'; if see no reason to change it. It is seeking text nodes that are not the children of <img> or <a> tags.

    ...now it gets a little fiddly... Now that we are dealing with isolated text nodes, regex can be used to differentiate qualifying strings from non-qualifying strings.

    • preg_split() is creating a flat, indexed array of non-empty substrings. Substrings which qualify for translation will be isolated as elements and if there are any non-qualifying substrings, they will be isolated elements.

      • The final text node in my sample will generate 4 elements:

        0 => '
        ',                                 // non-qualifying newline
        1 => 'Max KANTCHEDE',              // translatable string
        2 => ' & ',                        // non-qualifying text
        3 => 'Kathryn Kuhlman'             // translatable string
        
    • For translatable strings, new <a> nodes are created and filled with the appropriate attributes and text, then pushed into a temporary array.

    • For non-translatable strings, text nodes are created, then pushed into a temporary array.

    • If any translations/replacements have been done, then dom is updated; otherwise, no mutation of the document is necessary.

    • In the end, the finalized html document is echoed, but because your sample input has some text that is not inside of tags, the temporary leading <p> and trailing </p> tag that DomDocument applied for stability must be removed to restore the structure to its original form. If all text is enclosed in tags, you can just use saveHTML() without any hacking at the string.