Search code examples
pythonunicodepython-unicode

Make json.dumps output unicode characters properly in Python


I'm new to Python and I'm trying to encode a UTF8 string. Using PHP's json_encode(), (U+2026) becomes \u2026. However, using Python's json.dumps(), it becomes \u00e2\u20ac\u00a6. How do I convert this to \u2026 in Python?

Here's the entire program:

import nltk
import json

file=open('pos_tag.txt','r')
tags=nltk.pos_tag(nltk.word_tokenize(file.read()))

print(json.dumps(tags,separators=(',',':')))

Solution

  • The problem lied in file.open(). I was able to fix it using the codecs module:

    import nltk
    import json
    import codecs
    
    file=codecs.open('pos_tag.txt','r','utf-8')
    tags=nltk.pos_tag(nltk.word_tokenize(file.read()))
    
    print(json.dumps(tags,separators=(',',':')))