Search code examples
pythonpython-2.7decodepython-unicode

how to decode a string containing Persian/Arabic characters?


In web scraping sometimes I need to get data from Persian webpages, so when I try to decode it and see the extracted data, the result is not what I expect to be.

Here is the step-by-step code for when this problem occurs :

1.getting data from a Persian website

import urllib2

data = urllib2.urlopen('http://cafebazar.ir').read() # this is a persian website

2.detecting type of encoding

import chardet
chardet.detect(data)
# in this case result is : 
{'confidence': 0.6567038227597763, 'encoding': 'ISO-8859-2'}

3. decoding and encoding

final = data.decode(chardet.detect(data)['encoding']).encode('ascii', 'ignore')

but the final result is not in Persian at all !


Solution

  • The fundamental problem is that character-set detection is not a completely deterministic problem. chardet, and every program like it, is a heuristic detector. There is no guarantee or expectation that it will guess correctly all the time, and your program needs to cope with that.

    If your problem is a single web site, simply inspect it and hard-code the correct character set.

    If you are dealing with a constrained set of sites, with a restricted and somewhat predictable set of languages, most heuristic detectors have tweaks and settings you can pass in to improve the accuracy by constraining the possibilities.

    In the most general case, there is no single solution which works correctly for all the sites in the world.

    Many sites lie, they give you well-defined and helpful Content-Type: headers and lang tags ... which totally betray what's actually there - sometimes because of admin error, sometimes because they use a CMS which forces them to pretend their site is in a single language when in reality it isn't; and often because there is no language support in the back end, and something along the way "helpfully" adds a tag or header when in fact it would be more correct and actually helpful to say you don't know when you don't know.

    What you can do is to code defensively. Maybe try chardet, then fall back to whatever the site tells you, then fall back to UTF-8, then maybe Latin-1? The jury is out while the world keeps on changing...