Search code examples
mysqlsqlgroup-byduplicates

How to find duplicate rows with similar part of string


I have thousands of rows in a table. Some rows have similar keywords but can be categorized in the same group. For example:

Table : Birds_Name
    +-------+---------------------+
    |ID     |Name                 |
    +-------+---------------------+
    |1      |Blue Peckwood        |
    +-------+---------------------+
    |2      |North Peckwood       |
    +-------+---------------------+
    |3      |Northern Peckwood    |
    +-------+---------------------+
    |4      |Northern Peckwood    |
    +-------+---------------------+
    |5      |Red Heron            |
    +-------+---------------------+
    |6      |Red Heron            |
    +-------+---------------------+

As for the table above there should be 2 groups of birds. They are Peckwood and Heron.

But after I run this mySQL I get :

SELECT *
FROM birds_name
WHERE name IN (
    SELECT name
    FROM birds_name
    GROUP BY name
    HAVING COUNT(*) > 1
)

After I run the query. This is what I've got:

    +-------+---------------------+
    |3      |Northern Peckwood    |
    +-------+---------------------+
    |4      |Northern Peckwood    |
    +-------+---------------------+
    |5      |Red Heron            |
    +-------+---------------------+
    |6      |Red Heron            |
    +-------+---------------------+

Actually, I expect any row which share a similar string to be chosen (in this case it's Peckwood. So it should have only 2 groups - Peckwood and Heron.

Is it possible to do so? And how to adapt mysql code to achieve it?

Regards.


Solution

  • Try this

    SELECT SUBSTRING_INDEX(name,' ',-1),count(*)
    FROM birds_name
    GROUP BY SUBSTRING_INDEX(name,' ',-1) HAVING count(*)>0;
    

    Manual for SUBSTRING_INDEX function in mysql.