I am a python beginner. I am trying to add(concatenate) the text from all the 8 text files into one text file to make a corpus. However, I am getting the error UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 7311: character maps to
filenames = glob2.glob('Final_Corpus_SOAs/*.txt') # list of all .txt files in the directory
print(filenames)
output: ['Final_Corpus_SOAs\\1.txt', 'Final_Corpus_SOAs\\2.txt', 'Final_Corpus_SOAs\\2018 SOA Muir.txt', 'Final_Corpus_SOAs\\3.txt', 'Final_Corpus_SOAs\\4.txt', 'Final_Corpus_SOAs\\5.txt', 'Final_Corpus_SOAs\\6.txt', 'Final_Corpus_SOAs\\7.txt', 'Final_Corpus_SOAs\\8.txt']
with open('output.txt', 'w',encoding="utf-8") as outfile:
for fname in filenames:
with open(fname) as infile:
for line in infile:
outfile.write(line)
Output: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 7311: character maps to undefined
Thanks for the help.
If you are sure of the encoding, you should declare it when you open the files, both for reading and writing:
encoding = 'utf8' # or 'latin1' or 'cp1252' or...
with open('output.txt', 'w',encoding=encoding) as outfile:
for fname in filenames:
with open(fname, encoding=encoding) as infile:
for line in infile:
outfile.write(line)
If you are unsure or do not want to be bothered by encoding, you can copy the files at the byte level by reading and writing them as binary:
with open('output.txt', 'wb') as outfile:
for fname in filenames:
with open(fname, 'rb') as infile:
for line in infile:
outfile.write(line)