I am trying to process a large collection of txt files which themselves are containers for the actual files that I am wanting to process. The txt files have sgml tags that set boundaries for the individual files I am processing. Sometimes, the contained files are binary that have been uuencoded. I have solved the problem of decoding the uuencoded files but as I was mulling over my solution I have determined that it is not general enough. That is, I have been using
if '\nbegin 644 ' in document['document']
to test if the file is uuencoded. I did some searching and have a vague understanding of what the 644 means (file permissions) and have then found other examples of uuencoded files that might have
if '\nbegin 642 ' in document['document']
or even some other alternates. Thus, my problem is how do I make sure that I capture/identify all of the subcontainers that have uuencoded files.
One solution is to test every subcontainer:
uudecode=codecs.getdecoder("uu")
for document in documents:
try:
decoded_document,m=uudecode(document)
except ValueError:
decoded_document=''
if len(decoded_document)==0
more stuff
This is not horrible, cpu-cycles are cheap but I am going to be handling some 8 million documents.
Thus, is there a more robust way to recognize whether or not a particular string is the result of uuencoding?
Wikipedia says that every uuencoded file begins with this line
begin <perm> <name>
So probably a line matching the regexp ^begin [0-7]{3} (.*)$
denotes the beginning reliably enough.