Search code examples
pythonpython-3.xtwitterunicodeencoding

Where is my encoding wrong? (The letter "x95 \ x84 \ xeb \...." appears.)


Where is my encoding wrong? (The letter "o" appears.)

Sys.setdefaultencoding ('utf-8') This statement has been removed. I use Python 3.

Then the letters x95 \ x84 \ xeb \ xb0 \ xb0 \ xea \ xb3 \ xa0 \ xed \ x8c \ x8c 'come out like this.

Where is my encoding wrong?

I also find it hard to understand,

346 seconds: 52.25020146369934
347 seconds: 52.694828271865845
348 seconds: 52.80767774581909
349 seconds: 52.92116045951843

After this way, the data(tweets) comes out. What does that mean ?

#py3.6
import time
from selenium import webdriver
import codecs
import sys
import importlib

importlib.reload (sys)

browser = webdriver.PhantomJS('C:\phantomjs-2.1.1-windows/bin/phantomjs')
url = u'https://twitter.com/search?f=tweets&vertical=default&q=%EB%B0%B0%EA%B3%A0%ED%8C%8C%20since%3A2017-07-19%20until%3A2017-07-20&l=ko&src=typd&lang=ko'

browser.get(url)
time.sleep(1)

body = browser.find_element_by_tag_name('body')
browser.execute_script("window.scrollTo(0,document.body.scrollHeight);")

start = time.time()
for _ in range(5000):
    now = time.time()
    browser.execute_script("window.scrollTo(0,document.body.scrollHeight);")
    print (str(_) + "    seconds: " + str(now - start))
    time.sleep(0.1)

tweets=browser.find_elements_by_class_name('tweet-text')

with codecs.open("dlrjtdmstnrwp.txt", "w","utf-8") as f:
    i = 1
    for i, tweet in enumerate(tweets):
        data = tweet.text
        data = data.encode('utf-8')
        print (i, ":", data)
    msg = (str(data) +'\n')
    f.write(msg)
    i += 1

end = time.time() 
print(end - start)
browser.quit()

Solution

  • This is an answer to a simplified version of your problem.

    Since i don't know Korean, I used Google Translate. I typed 'hello' and translated this into Korean. Then I looked at the 'inspect element' for the translation result. This is what is got:

    Hello tanslation in Korean

    extracting the span element with selenium in your case is equivalent to extracting the tweet-text element:

    span = browser.find_element_by_class_name('short_text')
    print(span.text)
    

    This will us the result:

    >>>안녕하세요
    

    As you can see, no encoding\decoding was needed because in Python 3.x str = unicode.