Search code examples
mysqlsqlcoldfusionduplicates

Find duplicate emails in MySql and delete certain ones based on conditions


I see a lot of ways to find duplicate records and delete them but I could not find the one I am looking for i.e. to delete them based on certain condition.

The first SQL code below taken from another post works very well in locating duplicates. However, I am having problem withe the second part.

<cfquery datasource="mydatabase" name="duplist">
SELECT  a.*, b.totalCount AS Duplicate
FROM    mytable a
        INNER JOIN
        (
            SELECT  email, COUNT(*) totalCount
            FROM    mytable
            GROUP   BY email
            HAVING  COUNT(*) > 1
        ) b ON a.email = b.email
</cfquery>

<cfoutput query="duplist">
    <CFQUERY DATASOURCE="mydatabase" name="dlist">
    SELECT * FROM mytable
    WHERE userid = '#userid#'
    AND activedate is null
    </CFQUERY>
</cfoutput> 

<cfoutput>
Total To Be Deleted: #dlist.recordcount#
</cfoutput>

I am trying use SELECT (before replace it with DELETE) only those duplicates user emails that do not have an "activedate" based on the output query "duplist" with duplicated email records. I am only getting one record (in #dlist.recordcount#) which is the last record! Isn't the query supposed to loop through them all?

EDITED:

After some feedback and checking the recordcount wasn't counting correctly because it is outside the loop. It will work in retrieving those records without startdate. However, now the real problem is in the following scenario:

ID   USERID   EMAIL          STARTDATE
1    user1    test@test.com 
2    user2    test@test.com  11/01/2014
3    user3    test@test.com 
4    user4    test@test.com  11/02/2014 
5    user5    me@mydomain.com
6    user6    me@mydomain.com

Running through the above code will list these 6 records of duplicate emails. The second query will select(delete) those without startdate i.e. Record 1, 3, 5 and 6. Two problems:

  1. Record 1 and 3 will be deleted, but 2 and 4 are still duplicates
  2. Record 5 and 6 will be deleted and I am left with no user with the email. I need to keep one of them.

How can the codes be modified to deal with the scenarios? I want to keep record #4 and #6 because 4 is newer than 2 and 6 is newer than 6.

4    user4    test@test.com  11/02/2014 
6    user6    me@mydomain.com

Solution

  • You can use the Rank() function to rank the data and delete the rows which are not ranked 1.

    Oracle equivalent of the Rank() query to select the #4 and #6 data:

    SELECT * FROM
    (
    SELECT USERNAME, EMAIL, START_DATE, 
            RANK() OVER (PARTITION BY EMAIL ORDER BY START_DATE DESC NULLS LAST, ROWNUM DESC) AS RANK
    FROM TEMP_SOL
    )
    WHERE RANK = 1
    

    Mysql equivalent can be derived using this link Rank function in MySQL with Order By clause

    Edit : A non-rank solution can be something like:

    First get all the duplicate emails

    <cfquery name="dupEmail" datasource="XXX">
    SELECT EMAIL
    FROM TEMP_SOL
    GROUP BY EMAIL
    HAVING COUNT(*) > 1
    </cfquery>
    

    Loop through dupEmail and

    1. select the username which doesn't have to be deleted.
    2. Select/Delete the usernames not equal to the above username having the same emailid.

      <cfloop query="dupEmail">
         <cfquery name="UserToRetain" datasource="XXX">
             SELECT USERNAME FROM TEMP_SOL
             WHERE EMAIL = '#dupEmail.Email#'
             ORDER BY START_DATE DESC NULLS LAST, ROWNUM DESC
             LIMIT 1 
        </cfquery>
      
        <cfquery name="DeleteUsers" datasource="XXX">
               SELECT * FROM TEMP_SOL
               WHERE USERNAME <> '#UserToRetain.USERNAME#' AND EMAIL='#dupEmail.Email#'
        </cfquery>
      
      </cfloop>
      

    Note: The code above hasn't been tested for and the queries may not run on mysql.