Where is my encoding wrong? (The letter "o" appears.)
Sys.setdefaultencoding ('utf-8')
This statement has been removed.
I use Python 3
.
Then the letters x95 \ x84 \ xeb \ xb0 \ xb0 \ xea \ xb3 \ xa0 \ xed \ x8c \ x8c 'come out like this.
Where is my encoding wrong?
I also find it hard to understand,
346 seconds: 52.25020146369934
347 seconds: 52.694828271865845
348 seconds: 52.80767774581909
349 seconds: 52.92116045951843
After this way, the data(tweets) comes out. What does that mean ?
#py3.6
import time
from selenium import webdriver
import codecs
import sys
import importlib
importlib.reload (sys)
browser = webdriver.PhantomJS('C:\phantomjs-2.1.1-windows/bin/phantomjs')
url = u'https://twitter.com/search?f=tweets&vertical=default&q=%EB%B0%B0%EA%B3%A0%ED%8C%8C%20since%3A2017-07-19%20until%3A2017-07-20&l=ko&src=typd&lang=ko'
browser.get(url)
time.sleep(1)
body = browser.find_element_by_tag_name('body')
browser.execute_script("window.scrollTo(0,document.body.scrollHeight);")
start = time.time()
for _ in range(5000):
now = time.time()
browser.execute_script("window.scrollTo(0,document.body.scrollHeight);")
print (str(_) + " seconds: " + str(now - start))
time.sleep(0.1)
tweets=browser.find_elements_by_class_name('tweet-text')
with codecs.open("dlrjtdmstnrwp.txt", "w","utf-8") as f:
i = 1
for i, tweet in enumerate(tweets):
data = tweet.text
data = data.encode('utf-8')
print (i, ":", data)
msg = (str(data) +'\n')
f.write(msg)
i += 1
end = time.time()
print(end - start)
browser.quit()
This is an answer to a simplified version of your problem.
Since i don't know Korean, I used Google Translate. I typed 'hello' and translated this into Korean. Then I looked at the 'inspect element' for the translation result. This is what is got:
extracting the span
element with selenium in your case is equivalent to extracting the tweet-text
element:
span = browser.find_element_by_class_name('short_text')
print(span.text)
This will us the result:
>>>안녕하세요
As you can see, no encoding\decoding was needed because in Python 3.x str = unicode
.