Issue with scraping tweets using python

I am trying to scrape tweets from one webpage within a certain timeframe.

To do so I am using this link which only searches within the timeframe I have specified:

https://twitter.com/search?f=tweets&q=subwaydstats%20since%3A2016-08-22%20until%3A2018-08-22

This is my code:

import pandas as pd
import datetime as dt
import urllib.request
from bs4 import BeautifulSoup

url = 'https://twitter.com/search?f=tweets&q=subwaydstats%20since%3A2016-08-22%20until%3A2018-08-22'
thepage = urllib.request.urlopen(url)
soup = BeautifulSoup(driver.page_source,"html.parser")

i = 1
for tweet in soup.find_all('div', {'class': 'js-tweet-text-container'}):
    print(tweet.find('p', {'class': 'TweetTextSize'}).text.encode('UTF-8'))
    print(i)
    i += 1

The above code works when I am scraping from within the actual twitter page for the subwaystat user.

For this reason I don't understand why it doesn't work for the search page even though the html appears to be the same to me.

I am a total beginner so I'm sorry if this is a dumb question. Thank you!

Solution

There is a Twitter API - Twitter Search API docs: https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets which using a non-official Python wrapper: https://github.com/bear/python-twitter makes it super easy to get tweets.

However, if you want to scrape the HTML, then it's a lot more difficult. I was doing something similar - scraping an angular app, however, the actual HTML you see on the screen is actually rendered through "front-end javascript". Requests and urllib, only get the basic HTML but does not run the javascript.

You could use selenium which is basically a browser which you can automate task on. Since it behaves as a browser, it actually runs that front-end javascript, meaning you will be able to scrape the webpage.

A great article here explains the different ways you can scrape twitter https://medium.com/@dawran6/twitter-scraper-tutorial-with-python-requests-beautifulsoup-and-selenium-part-2-b38d849b07fe