Search code examples
pythonselenium-webdrivertwitter

Getting Tweet Author's Handle using Selenium


The Premise

I am working on a project and for a specific keyword, I want to get a list of 10 most recent tweets of the particular keyword.

I have tried using Tweepy (but it isn't working as I don't have elevated access to Twitter API). Other scrapers like SNScrape and Twint are also not working and giving timeouts.

The Problem

I have resolved to using Selenium to extract the required data. However, I am only able to extract the data of the "Tweet Text" and the "Time Stamp". I want the Twitter Handel of the person but it isn't working.

The Code

Here is the complete code:

import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from time import sleep
import pandas as pd

binary = FirefoxBinary('C:/Program Files/Mozilla Firefox/firefox.exe')
driver = webdriver.Firefox(executable_path='C:/Users/amrit/Downloads/geckodriver.exe', firefox_binary=binary)
driver.get("https://twitter.com/login")

sleep(3)
username = driver.find_element(By.XPATH,"//input[@name='text']")

username.send_keys("XYZ")
next_button = driver.find_element(By.XPATH,"//span[contains(text(),'Next')]")
next_button.click()


sleep(3)
password = driver.find_element(By.XPATH,"//input[@name='password']")


password.send_keys('XYZ')
log_in = driver.find_element(By.XPATH,"//span[contains(text(),'Log in')]")
log_in.click()


sleep(3)
search_box = driver.find_element(By.XPATH,"//input[@data-testid='SearchBox_Search_Input']")


search_box.send_keys("bjp")
search_box.send_keys(Keys.ENTER)

sleep(3)
latest = driver.find_element(By.XPATH,"//a[@href='/search?q=bjp&src=typed_query&f=live']")
latest.click()


UserTags=[]
TimeStamps=[]
Tweets=[]
Replys=[]
reTweets=[]
Likes=[]


articles = driver.find_elements(By.XPATH,"//article[@data-testid='tweet']")
while len(Tweets) < 10:
    for article in articles:
        try:
            UserTag = article.find_element(By.XPATH,".//div[@data-testid='User-Names']").text
            UserTags.append(UserTag)
        except:
            UserTags.append('')
        
        try:
            TimeStamp = article.find_element(By.XPATH,".//time").get_attribute('datetime')
            TimeStamps.append(TimeStamp)
        except:
            TimeStamps.append('')
        
        try:
            Tweet = article.find_element(By.XPATH,".//div[@data-testid='tweetText']").text
            Tweets.append(Tweet)
        except:
            Tweets.append('')
        
        try:
            Reply = article.find_element(By.XPATH,".//div[@data-testid='reply']").text
            Replys.append(Reply)
        except:
            Replys.append('')
        
        try:
            reTweet = article.find_element(By.XPATH,".//div[@data-testid='retweet']").text
            reTweets.append(reTweet)
        except:
            reTweets.append('')
        
        try:
            Like = article.find_element(By.XPATH,".//div[@data-testid='like']").text
            Likes.append(Like)
        except:
            Likes.append('')
    

    driver.execute_script('window.scrollTo(0,document.body.scrollHeight);')
    sleep(3)
    

    if len(Tweets) >= 10:
        break
    

    articles = driver.find_elements(By.XPATH,"//article[@data-testid='tweet']")
    Tweets = list(set(Tweets))


df = pd.DataFrame(zip(UserTags,TimeStamps,Tweets,Replys,reTweets,Likes)
                  ,columns=['UserTags','TimeStamps','Tweets','Replys','reTweets','Likes'])


df.to_excel(r"tweets.xlsx",index=False)
import os
os.system('start "excel" "tweets.xlsx"')

Please suggest me any code modifications on how can I achieve my objective.

Thanks!

I have tried getting the to be more specific on defining the path but I have not been able to make any advances.

I am pretty new to selenium and I am mostly referring to documentation and tutorials to get this done along with some help of generative AI.


Solution

  • I'm a bit confused on what issue/error you're running into when you go to search for tags. From what it appears, you're unable to find anything that's not the time stamp or tweet text.

    First thing I would fix is the first half of your code **. You don't need to click on the search bar and click etc to search for something on Twitter. Instead, you can use the URL to search for whatever keyword you want.

    For example, the URL: "https://twitter.com/search?q=dog&src=typed_query" is the same as what you've done before. Also doing it this way saves a lot more time because you have to load in and wait for elements to appear etc..

    There is one caveat to this and that is you have to log in to see tweets (most of the time). Since signing in will grant a token to your browser that keeps you logged in between sessions, you can use your data in the start to sign in, then switch tabs to start your search(s). To swap tabs, use the functions:

    from selenium.webdriver.common.window import WindowTypes
    
    driver.switch_to.new_window(WindowTypes.TAB)
    driver.get('https://twitter.com/search?q=dog&src=typed_query')
    

    Now with that, you can search multiple times in the same driver instance without having to close out again/make a new window.

    Now an answer to your problem (hopefully :). I've ran into this before where elements don't have that many unique tags to identify them, meaning you have to use an XPATH to find them.

    Some logic (just to show what I did) that can work here:

    for i in range(1, 10):
        tweet = driver.find_element(By.XPATH, f'//*[@id="react-root"]/div/div/div[2]/main/div/div/div/div[1]/div/div[3]/div/section/div/div/div[{i}]"]'
    

    Notice that the last div is with the variable i, which in a for loop increases by one each pass, which means it goes to the next tweet each pass

    Now with that, we can use the XPATH to find all the elements we want:

    for i in range(1, 10):
        tweet_handle = driver.find_element(By.XPATH, f'/html/body/div[1]/div/div/div[2]/main/div/div/div/div[1]/div/div[3]/div/section/div/div/div[{i}]/div/div/article/div/div/div[2]/div[2]/div[1]/div/div[1]/div/div/div[1]/div/a/div/div[1]/span/span'
    

    You can implement that logic into finding all the other info that you need. If you're not already, using dev tools on Chrome/Firefox/really any modern browser can make it super easy to find the XPATH's of elements.

    TLDR: Use the URL to search for the tweets and resort to using the XPATHs of the tweet elements to find the data. When implementing XPATH, change the div number to be the variable "i" in a for loop to change between tweets.