Search code examples
pythonxlrd

how to deal with japanese word using python xlrd


this is my code:

#!/usr/bin/python   
#-*-coding:utf-8-*-   

import xlrd,sys,re

data = xlrd.open_workbook('a.xls',encoding_override="utf-8")
a = data.sheets()[0]
s=''
for i in range(a.nrows):
    if 9<i<20:
        #stage
        print a.row_values(i)[1].decode('shift_jis')+'\n'

but it show :

????
????????
??????
????
????
????
????????

so what can i do ,

thanks


Solution

  • Background: In a "modern" (Excel 97-2003) XLS file, text is effectively stored as Unicode. In older files, text is stored as 8-bit strings, and a "codepage" record tells how it is encoded e.g. the integer 1252 corresponds to the encoding known as cp1252 or windows-1252. In either case, xlrd presents extracted text as unicode objects.

    Please insert this line into your code:

    print data.biff_version, data.codepage, data.encoding
    

    If you have a new file, you should see

    80 1200 utf_16_le
    

    In any case, please edit your question to report the outcome.

    Problem 1: encoding_override is required ONLY if the file is an old file AND you know/suspect that the codepage record is omitted or wrong. It is ignored if the file is a new file. Do you really know that the file is pre-Excel-97 and the text is encoded in UTF-8? If so, it can only have been created by some seriously deluded 3rd-party software, and Excel will blow up if you try to open it with Excel; visit the author with a baseball bat. Otherwise, don't use encoding_override.

    Problem 2: You should have unicode objects. To display them, you need to encode (not decode) them from unicode to str using a suitable encoding. It is very suprising that print unicode_object.decode('shift-jis') doesn't raise an exception and prints question marks.

    To help understand this, please change your code to be like this:

    text = a.rowvalues(i)[1]
    print i, repr(text)
    print repr(text.decode('shift-jis'))
    

    and report the outcome.

    So that we can help you choose an appropriate encoding (if any), tell us what version of what operating system you are using, and what the following display:

    print sys.stdout.encoding
    import locale
    print locale.getpreferredencoding()
    

    Further reading:

    (1) the xlrd documentation (section on Unicode, right up the front) ... included in the distribution, or get the latest commit here.

    (2) the Python Unicode HOWTO.