Search code examples
pythonasciifile-extensionfile-encodings

Loop Through File Extensions, Looking for Non-ASCII Characters - Python


I wrote a little Python program that looks though a directory (and its subdirectories) for files that contain non-ASCII characters.

I want to improve it. I know that certain files in this "directory" may be ZIP, DTA/OUT, OMX, SFD/SF3, etc... files that ARE SUPPOSED to have non-ASCII characters. So I want to know these are there and screen the ones that shouldn't contain ASCII characters, because my ultimate goal is to find files that should not contain non-ASCII characters that do and remove them (corrupt disk with bad sectors with TB worth of important data).

My thinking is to further look through the files that are in the "except" portion of a try/except block in Python that looks like this:

try:
    content.encode('ascii')
    output.write(str(counter) + ", " + file + ", ASCII\n")
    print str(counter) + " ASCII file status logged successfully: " + file
    counter += 1 

except UnicodeDecodeError:
    output.write(str(counter) + ", " + file + ", non-ASCII\n")
    print str(counter) + " non-ASCII file status logged successfully: " + file
    counter += 1 

When I started to write the code, I realized that looping through asking if the file is '.zip' or '.sfd' pr '.omx', etc... would be a clunky program and take for ever.

Is there any way to search a group of file extensions other than one by one? Maybe a file containing these extensions to check against? Or something I haven't thought of? My apologies in advance if this is a stupid question, but there are so many cool functions in Python that I'm sure I'm missing something that can help.

Cheers.


Solution

  • I figure since there aren't any answers I can go ahead and answer this myself with a partial answer. I basically took a different approach and looked for a particular file that is expected to be abundant for this share and then will do the same for each file. It's kind of hacky, but it will get the j ob done.