Search code examples
sqlpostgresqlgreatest-n-per-group

How to remove duplicates in one column based on the value of other columns in psql


I have a database which is supposed to imitate a library management system. I want to write a query that, present a table that shows the top 3 borrowed books for each publisher, also showing their corresponding rank (so the book borrowed the most times from publisher X will show rank 1). I have a query that displays the information below - title of borrowed books together with their corresponding publisher, and the amount of times each book has been borrowed. As you can see; Bloomsbury (UK) is present 7 times (one for each Harry Potter book) but I want it to only display the 3 most popular Harry Potter books in regards to amount of times borrowed. I'm very thankful for any help.

                  title                   |       publisher        | times
------------------------------------------+------------------------+------
 Harry Potter and the Philosopher's Stone | Bloomsbury (UK)        |    2
 Harry Potter and the Deathly Hallows     | Bloomsbury (UK)        |    2
 Harry Potter the Goblet of Fire          | Bloomsbury (UK)        |    3
 The Fellowship of the Ring               | George Allen & Unwin   |    1
 Calculus                                 | Paerson Addison Wesley |    1
 Go Set a Watchman                        | HarperCollins          |    1
 Harry Potter the Half-Blood Prince       | Bloomsbury (UK)        |    4
 Harry Potter and the Chamber of Secrets  | Bloomsbury (UK)        |    3
 Harry Potter and Prisoner of Azkaban     | Bloomsbury (UK)        |    2
 Nineteen Eighty-Four                     | Secker & Warburg       |    1
 Harry Potter the Order of the Phoenix    | Bloomsbury (UK)        |    4
 To Kill a Mockingbird                    | J.B.Lippincott & Co    |    1

The query below will generate the view above.

SELECT title, publisher, COUNT(borrowed.resid) AS rank 
FROM borrowed 
  CROSS JOIN book 
  CROSS JOIN bookinfo 
WHERE borrowed.resid = book.resid 
  AND book.isbn = bookinfo.isbn 
  AND book.copynumber = borrowed.copynumber 
GROUP BY title, publisher;

Solution

  • Fixing the joins and adding RANK:

    select *
    from 
     (
        SELECT title, publisher, COUNT(*) AS cnt,
           -- rank the counts
           rank() over (partition by publisher order by count(*) desc) as rnk 
        FROM borrowed 
          JOIN book 
            ON borrowed.resid = book.resid 
           AND book.copynumber = borrowed.copynumber 
          JOIN bookinfo 
            ON book.isbn = bookinfo.isbn 
        GROUP BY title, publisher
     ) as dt
    where rnk <= 3
    

    You might want to switch to ROW_NUMBER (exactly 3 rows) or DENSE_RANK (3 highest counts) instead of RANK (3 rows, maybe more if row #4+ got the same count as row #3).