I see a lot of ways to find duplicate records and delete them but I could not find the one I am looking for i.e. to delete them based on certain condition.
The first SQL code below taken from another post works very well in locating duplicates. However, I am having problem withe the second part.
<cfquery datasource="mydatabase" name="duplist">
SELECT a.*, b.totalCount AS Duplicate
FROM mytable a
INNER JOIN
(
SELECT email, COUNT(*) totalCount
FROM mytable
GROUP BY email
HAVING COUNT(*) > 1
) b ON a.email = b.email
</cfquery>
<cfoutput query="duplist">
<CFQUERY DATASOURCE="mydatabase" name="dlist">
SELECT * FROM mytable
WHERE userid = '#userid#'
AND activedate is null
</CFQUERY>
</cfoutput>
<cfoutput>
Total To Be Deleted: #dlist.recordcount#
</cfoutput>
I am trying use SELECT (before replace it with DELETE) only those duplicates user emails that do not have an "activedate" based on the output query "duplist" with duplicated email records. I am only getting one record (in #dlist.recordcount#) which is the last record! Isn't the query supposed to loop through them all?
EDITED:
After some feedback and checking the recordcount wasn't counting correctly because it is outside the loop. It will work in retrieving those records without startdate. However, now the real problem is in the following scenario:
ID USERID EMAIL STARTDATE
1 user1 test@test.com
2 user2 test@test.com 11/01/2014
3 user3 test@test.com
4 user4 test@test.com 11/02/2014
5 user5 me@mydomain.com
6 user6 me@mydomain.com
Running through the above code will list these 6 records of duplicate emails. The second query will select(delete) those without startdate i.e. Record 1, 3, 5 and 6. Two problems:
How can the codes be modified to deal with the scenarios? I want to keep record #4 and #6 because 4 is newer than 2 and 6 is newer than 6.
4 user4 test@test.com 11/02/2014
6 user6 me@mydomain.com
You can use the Rank() function to rank the data and delete the rows which are not ranked 1.
Oracle equivalent of the Rank() query to select the #4 and #6 data:
SELECT * FROM
(
SELECT USERNAME, EMAIL, START_DATE,
RANK() OVER (PARTITION BY EMAIL ORDER BY START_DATE DESC NULLS LAST, ROWNUM DESC) AS RANK
FROM TEMP_SOL
)
WHERE RANK = 1
Mysql equivalent can be derived using this link Rank function in MySQL with Order By clause
Edit : A non-rank solution can be something like:
First get all the duplicate emails
<cfquery name="dupEmail" datasource="XXX">
SELECT EMAIL
FROM TEMP_SOL
GROUP BY EMAIL
HAVING COUNT(*) > 1
</cfquery>
Loop through dupEmail and
Select/Delete the usernames not equal to the above username having the same emailid.
<cfloop query="dupEmail">
<cfquery name="UserToRetain" datasource="XXX">
SELECT USERNAME FROM TEMP_SOL
WHERE EMAIL = '#dupEmail.Email#'
ORDER BY START_DATE DESC NULLS LAST, ROWNUM DESC
LIMIT 1
</cfquery>
<cfquery name="DeleteUsers" datasource="XXX">
SELECT * FROM TEMP_SOL
WHERE USERNAME <> '#UserToRetain.USERNAME#' AND EMAIL='#dupEmail.Email#'
</cfquery>
</cfloop>
Note: The code above hasn't been tested for and the queries may not run on mysql.