The Twitter API is now paid, so now I need to write a parser for the tweet pages. I am using socks5 proxy.
So my first step was to get the tweet page directly through the socks5 proxy. I got a 302 code and an endless redirect.
Then I tried adding cookies and got a "Please enable JS" page.
So now we have decided to use Selenium to get this page. When I try to get the page without headless=new
there is no problem, but when I try to use that argument, the "please include JS" page reappears.
What I've tried:
Also I tried to install different user agents, different Selenium libraries, explicitly set the path to the chrome driver (v114.0.5735.90
driver and google chrome v114.0.5735.199
) and different browsers (Edge). JS was enabled.
I use the latest library version of the Selenium library, the language is C#
I created a simple console app for easy debugging - the basic code below should work (I believe):
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
Proxy proxy = new Proxy();
proxy.Kind = ProxyKind.Manual;
proxy.SocksVersion = 5;
proxy.SocksProxy = "host:port";
var options = new ChromeOptions();
options.AddArguments("--headless=new");
options.Proxy = proxy;
string pageSource = "";
using (var driver = new ChromeDriver(options))
{
driver.Navigate().GoToUrl("https://twitter.com/ElonMuskAOC/status/1677171220184469505");
pageSource = driver.PageSource;
}
Console.ReadLine();
Everything is fine, JS just needs time to execute). The noscript tag is always on the page and does not require additional time to appear