Search code examples
pythonpicklepython-unicode

pickling with unicode in Python3


I am trying to pickle a dictionary of the form {word : {docId : int}}. My code is below:

def vocabProcess(documents):
    word_splitter = re.compile(r"\w+", re.VERBOSE)
    stemmer=PorterStemmer()#
    stop_words = set(stopwords.words('english'))

    wordDict = {}
    for docId in documents:
        processedDoc = [stemmer.stem(w.lower()) for w in 
        word_splitter.findall(reuters.raw(docId)) if not w in stop_words]

        for w in processedDoc:
            if w not in wordDict:
                wordDict[w] = {docId : processedDoc.count(w)}
            else:
                wordDict[w][docId] = processedDoc.count(w)
    with open("vocabListings.txt", "wb") as f:
        _pickle.dump(wordDict, f)

if __name__ == "__main__":
    documents = reuters.fileids()
    with open("vocabListings.txt", "r") as f:
        vocabulary = _pickle.load(f)    

When I run this code, I get the error

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2399: 
character maps to <undefined>

Why is this breaking when none of the reuters docs/docids have unicode in them? How do I fix this so that I can still use the _pickle module?


Solution

  • You need to use binary mode for both writing and reading pickles. Your problem is:

    with open("vocabListings.txt", "r") as f:
        vocabulary = _pickle.load(f)    
    

    On Python 3, reading in text mode gives str (a text type) not bytes (the binary type that pickle works with). And it will try to decode the data as if it were text in your locale's encoding; a raw binary stream is not likely to be valid in many encodings, so you'll have an error before pickle even sees the data.

    On Python 2 on Windows, reading in text mode sometimes works, unless the binary data has a \r\n sequence in the data, in which case the data will be corrupted (it will be replaced with a \n in the data pickle sees).

    Either way, use mode "rb" to read (just like you used "wb" to write), and you'll be fine.