Search code examples
mysqlsphinxdistinctconcat-ws

Removing duplicate words mysql concat_ws


I have a query in which I select the data I need for a sphinx index. One of the things I do is a concat_ws of multiple name aliases, different languages and such. This presents a problem when the names overlap. For example: one entry has the names "Clannad", and the alternative title "CLANNAD -クラナド-". Another has the names "Clannad After Story", "クラナド アフターストーリー" and "Clannad: After Story". Now bear with me, because I know this would be easily resolved in this particular case, but I'd wish for it to be applicable all over the board. If you search "Clannad" you'll get the After Story entry first because of the double match on 'Clannad'.

What I'd like to do is remove all duplicate words/non-unique words in the concat_ws statement. If that is even possible.

The query looks something like:

SELECT CONCAT_WS(' ',a.Name,a.Name2,a.Name3,a.Name4) AS name

(I hope I structured this question correctly, this being my first here) Thank you,


Solution

  • As Marc has suggested in a comment, this quite painful to manage in SQL (as far as I can see). I'd suggest caching the processed value in another column, and then index that.

    SELECT a.name_words AS name, ...
    

    Combining each of your name values and then getting the distinct words is a separate matter - but that really depends on what language you have at hand. Regular expressions should be of some help though - here's a quick attempt in Ruby:

    [name, name2, name3, name4].join(' ').split(/\b/).reject { |word|
      word.blank?
    }.collect { |word|
      word.downcase
    }.uniq