Search code examples
pythonajaxweb-scrapingtwitterpython-requests-html

Twitter scraping using Python


I've been working on a project to reverse-enginner twitter's app to scrape public posts from Twitter using an unofficial API, with Python. (I want to create an "alternative" app, which is simply a localhost that can search for a user, and get its posts)

I've been searching and reading everything related to REST, AJAX, and the python modules requests, requests-html, BeautifulSoup, and more.

I can see when looking at twitter on the devtools (for example on Marvel's profile page) that the only relevant requests being sent (by POST and GET) are the following: client_event.json and UserTweets?variables=... . I understood that these are the relevant messages being received by cleaning the network tab and recording only when I scroll down and load new tweets - these are the only messages that came up which aren't random videos (I cleaned the search using -video -init -csp_report -config -ondemand -like -pageview -recommendations -prefetch -jot -key_live_kn -svg -jpg -jpeg -png -ico -analytics -loader -sharedCore -Hebrew).

I am new to this field, so I am probably doing something wrong. I can see on UserTweets the response I'm looking for - a beautiful JSON with all the data I need - but I am unable, no matter how much I've been trying to, to access it.

I tried different modules and different headers, and I get nothing. I DON'T want to use Selenium since it's tiresome, and I know where the data I need is stored. The JSON I want

I've been trying to send a GET reuest to: https://twitter.com/i/api/graphql/vamMfA41UoKXUmppa9PhSw/UserTweets?variables=%7B%22userId%22%3A%2215687962%22%2C%22count%22%3A20%2C%22cursor%22%3A%22HBaIgLLN%2BKGEryYAAA%3D%3D%22%2C%22withHighlightedLabel%22%3Atrue%2C%22withTweetQuoteCount%22%3Atrue%2C%22includePromotedContent%22%3Atrue%2C%22withTweetResult%22%3Afalse%2C%22withUserResults%22%3Afalse%2C%22withVoice%22%3Afalse%2C%22withNonLegacyCard%22%3Atrue%7D

by doing:

from requests_html import HTMLSession
from bs4 import BeautifulSoup

response = session.get('https://twitter.com/i/api/graphql/vamMfA41UoKXUmppa9PhSw/UserTweets?variables=%7B%22userId%22%3A%2215687962%22%2C%22count%22%3A20%2C%22cursor%22%3A%22HBaIgLLN%2BKGEryYAAA%3D%3D%22%2C%22withHighlightedLabel%22%3Atrue%2C%22withTweetQuoteCount%22%3Atrue%2C%22includePromotedContent%22%3Atrue%2C%22withTweetResult%22%3Afalse%2C%22withUserResults%22%3Afalse%2C%22withVoice%22%3Afalse%2C%22withNonLegacyCard%22%3Atrue%7D')
response.html.render()
s = BeautifulSoup(response.html.html, 'lxml')

but I get back an HTML script that either says Chromium is unsupported, or just a static page without the javascript updating the DOM.

All help appreciated.

Thank you

P.S I've posted the same question on reverseengineering.stackexchange, just to be safe (overflow has more appropriate tags :-))


Solution

  • Before you deep dive into the actual code, I would first start building the correct request to twitter. I would use a 3rd party tool focused on REST and APIs such as Postman to build and test the required request - and only then would write the actual code.

    From your questions it seems that you'll be using an open API of twitter, so it means you'll only need to send x-guest-token and basic Bearer authorization in your request headers.

    • The Bearer is static - you can just browse to twitter and copy/paste it from the dev tools network monitor.
    • To get the x-guest-token you'll need something dynamic because it has expiration, what I would suggest is send a curl request to twitter, parse the token from there and put it in your header before sending the request. You can see something very similar in: Python Downloading twitter video using python (without using twitter api) .

    After you have both of the above, build the required GET request in Postman and test if you get back the correct response. Only after you have everything working in Postman - write the same in Python, or any other language**

    **You can use Postman snippets which automatically generates the code needed in many programming languages.