Search code examples
python-3.xfile-iotext-filespython-unicodecorpus

UnicodeEncodeError when concatenating text files in Python


I am a python beginner. I am trying to add(concatenate) the text from all the 8 text files into one text file to make a corpus. However, I am getting the error UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 7311: character maps to

 filenames = glob2.glob('Final_Corpus_SOAs/*.txt')  # list of all .txt files in the directory
 print(filenames)

output: ['Final_Corpus_SOAs\\1.txt', 'Final_Corpus_SOAs\\2.txt', 'Final_Corpus_SOAs\\2018 SOA Muir.txt', 'Final_Corpus_SOAs\\3.txt', 'Final_Corpus_SOAs\\4.txt', 'Final_Corpus_SOAs\\5.txt', 'Final_Corpus_SOAs\\6.txt', 'Final_Corpus_SOAs\\7.txt', 'Final_Corpus_SOAs\\8.txt']

with open('output.txt', 'w',encoding="utf-8") as outfile:
for fname in filenames:
    with open(fname) as infile:
        for line in infile:
            outfile.write(line)

Output: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 7311: character maps to undefined

Thanks for the help.


Solution

  • If you are sure of the encoding, you should declare it when you open the files, both for reading and writing:

    encoding = 'utf8'    # or 'latin1' or 'cp1252' or...
    
    with open('output.txt', 'w',encoding=encoding) as outfile:
    for fname in filenames:
        with open(fname, encoding=encoding) as infile:
            for line in infile:
                outfile.write(line)
    

    If you are unsure or do not want to be bothered by encoding, you can copy the files at the byte level by reading and writing them as binary:

    with open('output.txt', 'wb') as outfile:
    for fname in filenames:
        with open(fname, 'rb') as infile:
            for line in infile:
                outfile.write(line)