Search code examples
mysqlsqllevenshtein-distance

How to count the number of duplicate records in a database?


Consider the following "tweets" table

tweet_id  user_id  text
-----------------------------------------
1         1        look at my tweet
2         1        look at my tweet
3         1        a very different tweet
4         1        look at my tweet
5         1        look at my tweets
6         2        a cool tweet
7         2        this is my tweet
8         1        hello
9         1        hello

For each user, I want to count the number of duplicate tweets. In the example above, user_id 1 has a total of 5 tweets, of which 2 are unique (tweet_id 3 and 5) and 3 are duplicate (1, 2 and 4). So the outcome of the query for user 1 should be "3".

[EDIT] Look at user_id 1. The tweet "look at my tweet" appears 3 times, the tweet "hello" 2 times. The total number of duplicate tweets is then 3 + 2 = 5.


Solution

  • For the first part, you can use the following query

    select user_id, sum(count)
    from
    (
    select user_id, text, count(tweet_id) count
    from tweets 
    group by 
    user_id, text
    having count(tweet_id) > 1
    ) t
    group by user_id
    

    The inner query finds all users and tweets that have occured more than once. The outer query adds up the duplicate values for each user