Search code examples
htmlmysqlregexsql-like

MySQL - Search and replace text (".html") from dynamic url's that exists in href attributes, inside content columns


I am trying to create a MySQL query with a regex reference in order to correct a data entry mistake in my blog. Specifically I need to locate some internal links in my articles that the href value ends with ".html" (including the dot) and remove that part which is the mistake. These links should be without .html extension. Of course this is not so simple as that.

THE PROBLEM: I have many external links that ends up with this extension too (.html), which I dont want to change at all. I only want to change some of my internal links that ends with .html. Also these links are relative. So the query must be more complex in order to find only these links content (appr. 6000 articles).

The good part is that these links I need to search have already a pattern. This is an example index.php?option=com_k2&view=item&id=6238:alias-title-of-an-article.html (Yes, it is a Joomla installation with K2)

So my first thought was to create a query that comes with 2 conditions. A) Search for all links that href value starts with index.php?option=com_k2&view=item&id= and remove the ending part of it .html (If any exists). So far I have managed to do that:

REGEX Ref: href="index\.php\?option=com_k2.*?(?=")

MySQL Query: UPDATE 'qc95t_k2_items' SET 'introtext' = REPLACE('introtext', '.html', '') WHERE 'catid'=84 AND 'introtext' LIKE '%href="index\.php\?option=com_k2.*?(?=")%';

Except from the fact that this query is 100% wrong because it will select all references with .html it also returnes 0 results as it cannot find anything. Also In this example below my regex reference is working (I think) exept the "ends with .html" condition.

REGEX Ref Example

I appreciate your help with that.


Solution

  • Finally, I came up with a workthrough on my own. Since it is mostly impossible to create such a complex query with regex replacements (REPLACE + REGEXP) directly through a MySQL Query, the solution I found is this:

    1. Simply download (Export) the particular table (table.sql)
    2. Open it with Notepad++ editor
    3. Create a regex that matches 2 groups of strings. The first group will contain the url part up to ".html" (in this case it would be "index.php?option=com_k2&view=item&id=6238:alias-title-of-an-article) and the second group which will contain only the part that you want to be removed (in this case it would be .html)
    4. Search and replace using Notepad++ adding the below regex command

    SEARCH: ([^href="]*"index\.php\?option=com_k2&view=item&id=\d{2,4}:[A-Za-z0-9]+(?:-[A-Za-z0-9]+)*[\w\-]*)\.html

    REPLACE WITH: $1

    1. Save and Upload your table back to your database

    The above Regex command (Search field) groups the url except the ".html" part and on Replace ($1) it tells to Notepad to replace all but keep the 1st group which is the proper url without .html. This works tottaly (Tested) and in my case I have cleaned up 658 urls. Below you can find the Regex in action:

    REGEX Command