Search code examples
pythonsqlsqliteduplicatesyahoo-finance

Removing duplicate key but unique rows with priority using SQLite3 in Python


I'm quite new to Python and I'm trying to get the following done:

  1. Pull RSS feed from Yahoo Finance News API ->YDN
  2. Put the Title, Link,PubDate into a Database table
  3. Store only unique Titles and the latest PubDate

1 and 2 I got through no problem but when I try to do 3, I ran into some trouble. The following is how the table looks like:

enter image description here

As you can see, I have duplicated Titles with different PubDate. What I've been trying to do is to follow the steps ilustrated from Microsoft Kb -> Support Microsoft.

I was able to get through steps 1 - 3 which does:

  1. Selecting all Titles with counts > 1 into a table called holdkey
  2. Selecting all distinct records with Titles equal to Titles in the holdkey table

my code is slightly different from what was shown in the MSFT link:

import sqlite3

cur = db.cursor()
cur.execute('''create table holdkey as 
            select Title,count(*) 
            from bar group by Title
            having count(*)>1''')
cur.execute('''create table holdup as 
            select distinct bar.* 
            from bar, holdkey 
            where bar.Title = holdkey.Title''')

What I need help on is that in my holdup table (as seen in the screenshot), I still get duplicate Titles because the pubDate is different and that it is known as distinct by SQLite. Is there a way to delete the duplicates based on the PubDate?

I would ideally just keep the latest one, but getting rid of the duplicate is more important.

Let me know if more clarification is required. Thanks


Solution

  • Try

    SELECT b.symbol, b.title, b.link, b.pubdate
      FROM bar b JOIN
    (
      SELECT title, MAX(pubdate) pubdate
        FROM bar
       GROUP BY title
    ) q ON b.title = q.title 
       AND b.pubdate = q.pubdate
    

    Here is SQLFiddle demo