Search code examples
pythontwittertweepy

Tweepy full-archive search, Twitter Advanced Search, and GetOldTweets3 are returning different numbers of Tweets


When using Tweepy, GetOldTweets3, and Twitter Advanced Search with the following parameters:

  • Query: "Accident"
  • Place: "Dallas, TX"
  • Since: "2018/1/1"
  • Until: "2018/1/2"

The number of Tweets are different for each method of searching. Tweepy, using full-archive search, returns 12 Tweets. GetOldTweets3 returns 22 Tweets. And using the Twitter Advanced Search returns 3 Tweets. Is there a reason for the different number of tweets?


Solution

  • Twitter's search through its website has different operators than its API.

    Searching "Accident near:Dallas,TX since:2018-01-01 until:2018-01-02" on Twitter itself, results in 22 Tweets. If you're looking at only the Top ones, there are only 3, yes, but you can see all of them by clicking the Latest tab. The near operator this query uses doesn't seem to be explicitly documented anywhere, so it's unclear how exactly it works. In fact, location/place doesn't even seem to be part of the Advanced Search UI anymore. Historically, it seems this worked by searching within a radius (defaulting to 15 miles if the within operator isn't set) of the location specified.

    The current branch/PR for Tweepy adding API.search_full_archive, which is what I assume you're using, uses the full-archive endpoint of Twitter's premium search APIs. Something like api.search_full_archive("Environment_Name", "Accident place:Dallas,TX", fromDate=201801010000, toDate=201801020000) does in fact return 12 Tweets. However, this is using the documented place premium search operator, which has specific defined behavior:

    Matches Tweets tagged with the specified location or Twitter place ID

    This means that it will only return Tweets that were tagged specifically with that location, rather than including other locations nearby within a certain radius. Oddly enough, these results actually include 2 Tweets that the website's search misses and doesn't seem to return by location search. This could be due to Twitter's search policies, but again, it's difficult to determine the exact reason since Twitter's website search isn't documented and is somewhat of a black box.

    If you want to specify a set of coordinates and radius for your search using the premium search API, you can do so with the point_radius premium search operator. Using Tweepy's API.geo_search method, which uses the Twitter API's GET geo/search endpoint, and a query for "Dallas,TX", the Place object returned that represents Dallas, TX specifies a centroid of [-96.7301749064317, 32.819858499999995]. There's no guarantee that these are the coordinates that Twitter's website search uses, but with some testing, using these coordinates with point_radius, the radius that would return the exact results matching the website search results seems to be somewhere between 17 and 18 miles. With a radius of 17.5 miles, there's only 3 extra Tweets from Plano.

    GetOldTweets3 does not use Twitter's API and instead scrapes the site directly. This should not be considered reliable and is against Twitter's Terms of Service:

    scraping the Services without the prior consent of Twitter is expressly prohibited

    If you want the most accurate and defined results, you should use Twitter's API. This is the only valid method if you want to retrieve those results programmatically without violating Twitter's TOS. Your options for searching by location are either by specifically for that location by name or Twitter place ID, coordinates and radius, or bounding box, using the place, point_radius, or bounding_box premium search operators, respectively. Note that for some reason, as those 2 other Tweets exhibited, certain Tweets might only be able to be found by specific location rather than area.