Search code examples
phpstringalgorithmformatted-text

How to improve my algorithm?/seaching and replacing words in a formated text/


I have a source of html, and an array of keywords. I'm trying to find all words which begin with any keyword in the keywords array and wrap it in a link tag.

For example, the keyword array has two values: [ABC, DEF]. It should match ABCDEF, DEFAD, etc. and wrap each word with hyperlink markup.

Here is the code I've got so far:

$_keys = array('ABC', 'DEF');
$text = 'Some ABCDD <strong>HTML</strong> text. DEF';

function search_and_replace(($key,$text)
{
    $words = preg_split('/\s+/', trim($text)); //to seprate words in $_text
    for($words as $word) 
    {
        if(strpos($word,$key) !== false)
        {
            if($word.startswith($key)) 
            {
                str_replace($word,'<a href="">'.$word.'</a>,$_text);
            }
        }

    }
    return text;
}


for($_keys as $_key)
{
    $text = search_and_replace($key,$text);
}

My questions:

  1. Would this algorithm work?
  2. How would I modify this to work with UTF-8?
  3. How can I recognize hyperlinks in the html and ignore them (don't want to put a hyperlink in a hyperlink).
  4. Is this algorithm safe?

Solution

  • is the algorithm "true"? ( I'm reading "accurate")

    No, it is not. Since str_replace functions as follows

    a string or an array with all occurrences of search in subject replaced with the given replace value.

    The string you're matching is not the only one being replaced. Using your example, if you ran this function against your data set, you'd end up wrapping each occurrence of ABC in multiple tags ( just run your code to see it, but you'll have to fix syntax errors).

    work with UTF-8 Alphabets?

    Not sure, but as written, I don't think so. See Preg_Replace and UTF8. PREG functions should be multibyte safe.

    I want to igonre all words in each a tag for search operetion

    That's awefully hard. You'll have to avoid <a ...>word</a>, which starts to make a big mess fast. Regex matching HTML reliably is a fool's errand.

    Probably the best would be to interpret the webpage as XML or HTML. Have you considered doing this in javascript? Why do it on the server side? The advantage of JS is twofold - one, it runs on the client side, so you're offloading / distributing the work, and two, since the DOM is already interpreted, you can find all text nodes and replace them fairly easily. In fact, I was helping a frend working on a chrome extension to to almost exactly what you're describing; you could modify it to do what you're looking for easily.

    a better alternative method?

    Definitely. What you're showing here is one of the worse methods of doing this. I'd push for you to use preg_replace ( another answer has a good start for the regex you'd want, matching word breaks tather than whitespace) but since you want to avoid changing some elements, I'm thinking now that doing this in JS client-side is far better.