Search code examples
phpmysqlregextagsmatch-against

Creating a tag-based related links feature using PHP and MySQL


I am developing a PHP-based system where users can create, post, and view pieces of content using a MySQL database, each piece of content being stored in a table row. When a user posts content, a PHP script extracts common words or tags (removing any stop words like ‘and’ and ‘or’), orders them by occurrence, and stores them all as an array within the database.

As well as viewing each piece of content and the tags generated, I would like a feature that displays a list of similar pieces of content posted which have one or more tags that the content being displayed has (similar to YouTube's related videos or related stories on news websites). Furthermore, I would like the list to be ordered based on how many of those tags each piece of content has.

I have done some research and I have developed two different scripts that can select rows from within the database based on whether any tags are present or not. However, both scripts have problems;

The first I tried was a LIKE query;

$tags=$row['tags'];
$tags2=explode(",",$tags);
foreach ($tags2 as $key => $keyword) {
    $sql = ("SELECT DISTINCT * FROM table WHERE tags LIKE '%$keyword%' ");
    if ($key != (sizeof($tags2)-1)){ 
        $sql .= " OR "; 
    }
    $sql .= " LIMIT 20";
}

The problem with this query is that it does not order the results. I then tried a MATCH AGAINST query;

$tags=$row['tags'];
$tags2=explode(",",$tags);
$searchstring="";
foreach ($tags2 as $word){
    $searchstring = $searchstring .' +'.$word;
}
$sql = ("SELECT * FROM table WHERE MATCH (tags) AGAINST ('$searchstring' IN BOOLEAN MODE)");

While the results are ordered by relevance, it only retrieves a row in which all tags are present, and if that row lacks even a single tag, it is not retrieved by the query.

What I want is to combine the best of both features, selects rows which contain one or more tags, and then order them based on how many tags are present. For example; if row1 contains 10 tags, row2 has 20 tags with 9 being found in row1, and row3 having 50 tags with 8 being found in row1, then both row2 and row3 will be retrieved, with row2 being more relevant than row3.

Any help would be appreciated


Solution

  • The inclusion of the + character in your $searchstring is what forces all tags to be present. If you put in just the words and omit the +, the engine will rank its results without requiring each word to be present.

    Take a look at the docs for fulltext searching in MySQL.

    You have many options with each word. + will force the word to be found somewhere in the result, - will force the word to not be found anywhere in the result, ~ will allow a word to be found but lower the result's ranking if it is found. Read the docs, they're very useful.