For some files, python's chardet library of chardet.detect(f.read())['encoding']
returns None.
path=r"C:\A chinese novel.TXT"
with codecs.open(path, 'rb') as f:
encoding=chardet.detect(f.read())
print(encoding)
# RETURN {'encoding': None, 'confidence': 0.0, 'language': None}
I'll use os.popen("file -bi \"%s\" | gawk -F'[ =]' '{print $3}'" % f).read()
view file coding, the compiler hints file encoding is unknown - 8 bit
'file xxx.txt' output xxx.txt: Non-ISO extended-ASCII text, with very long lines (560), with CRLF line terminator
Here's the GIf link to understand the situation: https://i.imgur.com/5kvmnRL.gif
However, Notepad++ can be opened normally, Notepad shows that the file is GB2312 encoding, and the character display is basically normal.
The file may become corrupted and so a mixed-encoding file that the chardet library cannot recognize?
Chatgpt suggested that I use iconv to re-encode the bad file, but the text editor (Notepad++) could not confirm which encoding the file is before opening. Is there a more reliable way to identify file encodings by python in windows10?
chardet
: A very popular Python package for detecting encoding.
cchardet
: A Python module written in C++, similar to the chardet package.
File-magic
: A Python-wrapped libmagic library that recognizes file types and encodings.
import chardet
import cchardet
import magic
# chardet
with open('your_file_path', 'rb') as f:
rawdata = f.read()
result = chardet.detect(rawdata)
encoding = result['encoding']
print(encoding)
# cchardet
with open('your_file_path', 'rb') as f:
rawdata = f.read()
result = cchardet.detect(rawdata)
encoding = result['encoding']
print(encoding)
# file-magic
with magic.Magic() as m:
file_type = m.id_filename('your_file_path')
print(file_type)
After verification, cchardet recognition effect is good. It can successfully output the correct encoding format.