Search code examples
pythonseleniumtwitter

Grabbing data using selenium and adding it to a dictionary for use in a dataframe


I have been trying to grab tweets off of twitter using selenium. I have been successful at getting the html that I want and printing it, but I am having trouble with getting into a form that is appropriate to use for a dataframe.

Here is my code:

import time
import pandas as pd
import numpy as np

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

browser = webdriver.Chrome()
url = 'https://twitter.com/search?f=tweets&q=cuomosmta%20since%3A2016-08-22%20until%3A2018-08-22'

browser.get(url)
time.sleep(1)

tweet_dict = {}

tweets = browser.find_elements_by_class_name('tweet-text')

for tweet in tweets:
    print(tweet.text)
    tweet_dict['tweet'] = tweet.text

If you run the code, you will see that it prints each individual tweet. I did this to ensure that the code was working.

But for some reason, when I check my dictionary, my output from:

tweet_dic['tweet']

is:

'Ugh, Cuomo and #CuomosMTA are terrible, just terrible.'

The output above is also the last tweet on the page that I am tyring to scrape.

I have tried this method multiple ways and even tried BeautifulSoup, but for some reason I keep getting the same result.

I don't understand why I am able to print all of the tweets but not append them to dictionary.

I am a beginner and am probably missing something very obvious so any help would be appreciated.

Please, if possible, I am trying to keep only using selenium since it is easier to use to grab the exact timestamp than it is in beautifulsoup.

Thank you!


Solution

  • Dictionary should contain unique keys only, so instead of appending each tweet in a loop, you're just overwriting the same key-value pair. You can try below solution:

    for tweet in range(len(tweets)):
        print(tweets[tweet].text)
        tweet_dict['tweet_%s' % tweet] = tweets[tweet].text
    

    The output should be as

    {'tweet_0': 'first tweet content', 'tweet_1': 'second tweet content', ...}