I have more than 1000 html files. I need to find out with REGEX if one link from <link tag>
is repeated in another location on the same file.
For example, in the first line there is a <link tag
with this link https://mywebsite.com/en/truth.html
.
Down below in an <img tag
and I have another link https://mywebsite.com/en/love.html
<link rel="canonical" href="https://mywebsite.com/en/truth.html" />
text text
text
<img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a> <a href="https://mywebsite.com/en/love.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a>
After using a regex formula, I should find those files that doesn't contain the same link in 2 different html tags. I made a regex, but is not very good.
This can find the first link from the <link tag: (<link rel="canonical" href="(.*?)" \/>.*?)
This can find the second link from <img tag: (alt="de" /></a> <a href=").*?("><img src)
and I use ?!
to exclude the second link, so the regex is:
FIND: (.matches newline)
(<link rel="canonical" href="(.*?)" \/>.*?)(?!(alt="de" /></a> <a href=")).*?("><img src)
But is not working, it finds both link, even if those are the same. I should find the files that doesn't contain the same link up and down.
The solutions:
FIND: (?s)<link\h+rel="canonical"\h*\Khref="([^"]+)"((?!<link).)+?<a href="(?!\1).+?"
or
FIND: (?s)^<link rel.+?https://([^"]+).+?https://(*SKIP)(?!\1)
or (.matches newline)
FIND: <link rel="canonical"[^>]*"(https[^"]+)"[^>]*>.*?(\1)
Thanks to those who people find this answers HERE