Search code examples
pythontwittertweepy

Getting duplicate tweets using tweepy to pull from user timelines


I'm trying to pull tweets from a list of accounts using tweepy. I am able to get the tweets, but I'm getting huge numbers of duplicate tweets from a single account. In some cases, I've pulled 400 tweets and had about half duplicates.

I've looked at the accounts on Twitter itself and confirmed that these accounts are not just tweeting the same thing over and over. I've also confirmed that they don't have a hundred-plus retweets that might account for this. When I look at the actual tweet object for the duplicates, everything is the exact same. The tweet ID is the same. The created at time is the same. There are no differences in retweet numbers. The @mentions and hashtags are the same. I'm not seeing any difference. I'm thinking it might be something in my loop, but everything I try yields the same result.

Any ideas? I don't want to just do a deduplicate because then I'll have substantially fewer tweets from some accounts.

# A list of the accounts I want tweets from
friendslist = ["SomeAccount", "SomeOtherAccount"] 

# Where I store the tweet objects
friendstweets = []

# Loop that cycles through my list of accounts to add tweets to friendstweets
for f in friendslist:
    num_needed = 400 # The number of tweets I want from each account
    temp_list = []
    last_id = -1 # id of last tweet seen
    while len(temp_list) < num_needed:
        try:
          new_tweets = api.user_timeline(screen_name = f, count = 400, include_rts = True)
        except tweepy.TweepError as e:
            print("Error", e)
            break
        except StopIteration:
            break
        else:
            if not new_tweets:
              print("Could not find any more tweets!")
              break
        friendstweets.extend(new_tweets) 
        temp_list.extend(new_tweets)
        last_id = new_tweets[-1].id
    print('Friend '+f+' complete.')

Solution

  • Your problem lies in this line: while len(temp_list) < num_needed:. Basically what you're doing is fetching same tweets for each user until you fetch more than 400 tweets.

    Fix i would suggest is removing that while loop and change the count of fetched tweets from 400 to num_nneded:

    new_tweets = api.user_timeline(screen_name = f, count = num_needed, include_rts = True)
    

    Hope it works as intended then.