I am trying to create a MySQL query with a regex reference in order to correct a data entry mistake in my blog. Specifically I need to locate some internal links in my articles that the href value ends with ".html" (including the dot) and remove that part which is the mistake. These links should be without .html extension. Of course this is not so simple as that.
THE PROBLEM: I have many external links that ends up with this extension too (.html), which I dont want to change at all. I only want to change some of my internal links that ends with .html. Also these links are relative. So the query must be more complex in order to find only these links content (appr. 6000 articles).
The good part is that these links I need to search have already a pattern. This is an example index.php?option=com_k2&view=item&id=6238:alias-title-of-an-article.html
(Yes, it is a Joomla installation with K2)
So my first thought was to create a query that comes with 2 conditions. A) Search for all links that href value starts with index.php?option=com_k2&view=item&id=
and remove the ending part of it .html
(If any exists). So far I have managed to do that:
REGEX Ref: href="index\.php\?option=com_k2.*?(?=")
MySQL Query: UPDATE 'qc95t_k2_items' SET 'introtext' = REPLACE('introtext', '.html', '') WHERE 'catid'=84 AND 'introtext' LIKE '%href="index\.php\?option=com_k2.*?(?=")%';
Except from the fact that this query is 100% wrong because it will select all references with .html
it also returnes 0 results as it cannot find anything. Also In this example below my regex reference is working (I think) exept the "ends with .html" condition.
I appreciate your help with that.
Finally, I came up with a workthrough on my own. Since it is mostly impossible to create such a complex query with regex replacements (REPLACE + REGEXP) directly through a MySQL Query, the solution I found is this:
"index.php?option=com_k2&view=item&id=6238:alias-title-of-an-article
) and the second group which will contain only the part that you want to be removed (in this case it would be .html
)SEARCH: ([^href="]*"index\.php\?option=com_k2&view=item&id=\d{2,4}:[A-Za-z0-9]+(?:-[A-Za-z0-9]+)*[\w\-]*)\.html
REPLACE WITH: $1
The above Regex command (Search field) groups the url except the ".html" part and on Replace ($1
) it tells to Notepad to replace all but keep the 1st group which is the proper url without .html
. This works tottaly (Tested) and in my case I have cleaned up 658 urls. Below you can find the Regex in action: