Search code examples
regexnotepad++

REGEX: Find those files that doesn't contain the same link in 2 different html tags


I have more than 1000 html files. I need to find out with REGEX if one link from <link tag> is repeated in another location on the same file.

For example, in the first line there is a <link tag with this link https://mywebsite.com/en/truth.html.

Down below in an <img tag and I have another link https://mywebsite.com/en/love.html

<link rel="canonical" href="https://mywebsite.com/en/truth.html" />

text text
    
text

<img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; <a href="https://mywebsite.com/en/love.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a>

After using a regex formula, I should find those files that doesn't contain the same link in 2 different html tags. I made a regex, but is not very good.

This can find the first link from the <link tag: (<link rel="canonical" href="(.*?)" \/>.*?) This can find the second link from <img tag: (alt="de" /></a>&nbsp; <a href=").*?("><img src)

and I use ?! to exclude the second link, so the regex is:

FIND: (.matches newline)

(<link rel="canonical" href="(.*?)" \/>.*?)(?!(alt="de" /></a>&nbsp; <a href=")).*?("><img src)

But is not working, it finds both link, even if those are the same. I should find the files that doesn't contain the same link up and down.


Solution

  • The solutions:

    FIND: (?s)<link\h+rel="canonical"\h*\Khref="([^"]+)"((?!<link).)+?<a href="(?!\1).+?"

    or

    FIND: (?s)^<link rel.+?https://([^"]+).+?https://(*SKIP)(?!\1)

    or (.matches newline)

    FIND: <link rel="canonical"[^>]*"(https[^"]+)"[^>]*>.*?(\1)

    Thanks to those who people find this answers HERE