Search code examples
mysqlgroup-bytagsnormalizecsv

mySQL >> Finding the most frequently used words in a comma-delimited field


I have a keywords field for each of my records called "RES_Tags". The table is "Resources".

The "RES_Tags" field contain a comma-delimited list of keywords for that record.

EX:

labor, work, unions, organized labor, strike, picket, boycott

What SQL query can I use to find out the 30 most frequently used tags?

I saw this related thread Count popular tags with comma delimited field on MySQL, but I'm hoping that someone has found a way since this question was originally asked.

Alternately, and the reason this questions isn't a duplicate, if it is impossible to use a SQL query to do what I'm asking and the only way is to normalize, what would be the best way to convert the existing comma-delimited lists into a Tags table and Tags-to-Resources table?


Solution

  • You can actually extract individual "terms" from a comma-separated list of terms in MySQL. It's incredibly nasty, and it requires knowing the maximum number of terms that will appear in any row. The SUBSTRING_INDEX() function is the key to it.

    Let's say you never have more than five terms in a field. Then this query gets all your terms.

    SELECT term FROM(
    SELECT TRIM(SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(RES_Tags,','), ',',1), ',', -1)) term FROM Resources
    UNION ALL
    SELECT TRIM(SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(RES_Tags,','), ',',2), ',', -1)) term FROM Resources
    UNION ALL
    SELECT TRIM(SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(RES_Tags,','), ',',3), ',', -1)) term FROM Resources
    UNION ALL
    SELECT TRIM(SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(RES_Tags,','), ',',4), ',', -1)) term FROM Resources
    UNION ALL
    SELECT TRIM(SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(RES_Tags,','), ',',5), ',', -1)) term FROM Resources
    UNION ALL
    SELECT TRIM(SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(RES_Tags,','), ',',6), ',', -1)) term FROM Resources
    ) terms
    WHERE LENGTH(term) > 0
    

    You can just put more terms in the union if you have a maximum term count more than five.

    Edit Should you normalize? Yes you should normalize. Can you use this kind of query to create a normalized version of your table. Yes. Here's some hints about how.

    Figure out how many tags are in the longest record you have now. Add two. Write this sort of query to support that number. Use it as part of a CREATE TABLE tags AS SELECT... query. Don't look back.