I have been trying to grab tweets off of twitter using selenium. I have been successful at getting the html that I want and printing it, but I am having trouble with getting into a form that is appropriate to use for a dataframe.
Here is my code:
import time
import pandas as pd
import numpy as np
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
browser = webdriver.Chrome()
url = 'https://twitter.com/search?f=tweets&q=cuomosmta%20since%3A2016-08-22%20until%3A2018-08-22'
browser.get(url)
time.sleep(1)
tweet_dict = {}
tweets = browser.find_elements_by_class_name('tweet-text')
for tweet in tweets:
print(tweet.text)
tweet_dict['tweet'] = tweet.text
If you run the code, you will see that it prints each individual tweet. I did this to ensure that the code was working.
But for some reason, when I check my dictionary, my output from:
tweet_dic['tweet']
is:
'Ugh, Cuomo and #CuomosMTA are terrible, just terrible.'
The output above is also the last tweet on the page that I am tyring to scrape.
I have tried this method multiple ways and even tried BeautifulSoup, but for some reason I keep getting the same result.
I don't understand why I am able to print all of the tweets but not append them to dictionary.
I am a beginner and am probably missing something very obvious so any help would be appreciated.
Please, if possible, I am trying to keep only using selenium since it is easier to use to grab the exact timestamp than it is in beautifulsoup.
Thank you!
Dictionary should contain unique keys only, so instead of appending each tweet in a loop, you're just overwriting the same key-value pair. You can try below solution:
for tweet in range(len(tweets)):
print(tweets[tweet].text)
tweet_dict['tweet_%s' % tweet] = tweets[tweet].text
The output should be as
{'tweet_0': 'first tweet content', 'tweet_1': 'second tweet content', ...}