Search code examples
pythonseleniumweb-scrapingtwitterbeautifulsoup

Twitter scraping of older tweets


I am doing a project in which I needed to get tweets from twitter, and I used the twitter API but it only gives tweets from 7-9 days old but I want a few months older tweets as well. So I decided to scrape Twitter using Beautifulsoup and later selenium, but when parsing it is not returning the elements but rather the veiwsource of the entire webpage. Please help!!

import requests
from bs4 import Beautifulsoup
f=requests.get("https://twitter.com/search?q=%23......%20until%3A2020-02-07%20since%3A2020-01-01&src=typed_query").text
soup = BeautifulSoup(f,'html.parser')

print(soup)

name = soup.find_all('span', class_="css-901oao css-16my406 r-1qd0xha r-ad9z0x r-bcqeeo r-qvutc0")

print(name)

the output from printing soup....i don't how to say it but its the viewsource but not the actual html code

{"undefined"!=typeof Symbol&&Symbol.toStringTag&&Object.defineProperty(e,Symbol.toStringTag,{value:"Module"}),Object.defineProperty(e,"__esModule",{value:!0})},t.t=function(e,n){if(1&n&&(e=t(e)),8&n)return e;if(4&n&&"object"==typeof e&&e&&e.__esModule)return e;var d=Object.create(null);if(t.r(d),Object.defineProperty(d,"default",{enumerable:!0,value:e}),2&n&&"string"!=typeof e)for(var o in e)t.d(d,o,function(n){return e[n]}.bind(null,o));return d},t.n=function(e){var n=e&&e.__esModule?function(){return e.default}:function(){return e};return t.d(n,"a",n),n},t.o=function(e,n){return Object.prototype.hasOwnProperty.call(e,n)},t.p="https://abs.twimg.com/responsive-web/web/",t.oe=function(e){throw e};var i=window.webpackJsonp=window.webpackJsonp||[],c=i.push.bind(i);i.push=n,i=i.slice();for(var l=0;l<i.length;l++)n(i[l]);var u=c;d()}([]),window.__SCRIPTS_LOADED__.runtime=!0;
//# sourceMappingURL=runtime.cc3200a4.js.map

Selenium output in the same as well

from selenium import webdriver
PATH = "C:\\Program Files\\chromedriver.exe"
driver = webdriver.Chrome(PATH) 
driver.get("https://twitter.com")

email = driver.find_element_by_name('session[username_or_email]')
password = driver.find_element_by_name('session[password]')

email.send_keys('......')
password.send_keys("......")
password.send_keys(Keys.RETURN)
time.sleep(1)

driver.get('https://twitter.com/search?q=%23....%20until%3A2020-02-07%20since%3A2020-01-01&src=typed_query')
time.sleep(1)

print(driver.page_source)

Solution

  • GetOldTweets3 enables you to extract historical tweets and filter based on multiple criteria i.e. time frame, location, handle, or search query without any API key prerequisites.

    E.g.

      import GetOldTweets3 as got
    
      # Tweet params
      search_term = 'china trade war'
      start_date = '2017-01-01'
      end_date = '2020-01-01'
      
      # Define historical tweets criteria
      tweet_criteria = got.manager.TweetCriteria().setUsername('reuters') \
                                                .setQuerySearch(search_term) \
                                                .setSince(start_date) \
                                                .setUntil(end_date) \
                                                
      # Return tweets based on tweet criteria
      tweets = got.manager.TweetManager.getTweets(tweet_criteria)
        
      tweets.text
     
    

    Note that you can access further tweet attributes such as hashtags, retweets etc through the tweet variable, for example:

    other_tweet_attributes = [[tweet.username, tweet.hashtags for tweet in tweets]]