Search code examples
pythonstringunicodescrapystring-conversion

how to convert unicode string on unicode format with python?


I'm a student to learn python scrapy(crawler).

I want to convert unicode string to str in python. but this unicode string is not common string. this unicode is unicode format. please see below code.

# python 2.7
...
print(type(name[0]))
print(name[0])
print(type(keyword_name_temp))
print(keyword_name_temp)
...

I can see console like below, when run upper script.

$ <type 'unicode'>
$ 서용교 ## this words is korean characters
$ <type 'unicode'>
$ u'\\uc9c0\\ubc29\\uc790\\uce58\\ub2e8\\uccb4'

I want see "keyword_name_temp" as korean. but I don't know how to do...

I got the name list and keyword_name_temp from html code with http request.

name list fundamentally was String format.

keyword_name_temp fundamentally was unicode format.

please anybody help me !


Solution

  • u'\\uc9c0\\ubc29\\uc790\\uce58\\ub2e8\\uccb4' contains real backslashes (backslash being an escape character in Python string literals, python interpreter prints backslash in strings as \\) followed by u and hex sequences, not literal Unicode characters U+C9C0 etc. which are commonly written using \u escape sequence (Would that string happen to come from some JSON object perhaps?)

    You can construct a JSON string out of it, and use json.loads() to transform to a unicode string:

    Example in Python 2.7:

    >>> s1 = u'서용교'
    >>> type(s1)
    <type 'unicode'>
    >>> s1
    u'\uc11c\uc6a9\uad50'
    >>> print(s1)
    서용교
    >>> 
    >>> 
    >>> s2 = u'\\uc9c0\\ubc29\\uc790\\uce58\\ub2e8\\uccb4'
    >>> type(s2)
    <type 'unicode'>
    >>>
    >>> # put that unicode string between double-quotes
    >>> # so that json module can interpret it
    >>> ts2 = u'"%s"' % s2
    >>> ts2
    u'"\\uc9c0\\ubc29\\uc790\\uce58\\ub2e8\\uccb4"'
    >>>
    >>> import json
    >>> json.loads(ts2)
    u'\uc9c0\ubc29\uc790\uce58\ub2e8\uccb4'
    >>> print(json.loads(ts2))
    지방자치단체
    >>> 
    

    Another option is to make it a string literal

    >>> import ast
    >>>
    >>> # construct a string literal, with the 'u' prefix
    >>> s2_literal = u'u"%s"' % s2
    >>> s2_literal
    u'u"\\uc9c0\\ubc29\\uc790\\uce58\\ub2e8\\uccb4"'
    >>> print(ast.literal_eval(s2_literal))
    지방자치단체
    >>> 
    >>> # also works with single-quotes string literals
    >>> s2_literal2 = u"u'%s'" % s2
    >>> s2_literal2
    u"u'\\uc9c0\\ubc29\\uc790\\uce58\\ub2e8\\uccb4'"
    >>> 
    >>> print(ast.literal_eval(s2_literal2))
    지방자치단체
    >>>