Search code examples
pythonuuencodeuudecode

Trying to determine if file has been uuencoded


I am trying to process a large collection of txt files which themselves are containers for the actual files that I am wanting to process. The txt files have sgml tags that set boundaries for the individual files I am processing. Sometimes, the contained files are binary that have been uuencoded. I have solved the problem of decoding the uuencoded files but as I was mulling over my solution I have determined that it is not general enough. That is, I have been using

if '\nbegin 644 ' in document['document']

to test if the file is uuencoded. I did some searching and have a vague understanding of what the 644 means (file permissions) and have then found other examples of uuencoded files that might have

if '\nbegin 642 ' in document['document']

or even some other alternates. Thus, my problem is how do I make sure that I capture/identify all of the subcontainers that have uuencoded files.

One solution is to test every subcontainer:

uudecode=codecs.getdecoder("uu")

for document in documents:
    try:
        decoded_document,m=uudecode(document)
    except ValueError:
         decoded_document=''
    if len(decoded_document)==0
        more stuff

This is not horrible, cpu-cycles are cheap but I am going to be handling some 8 million documents.

Thus, is there a more robust way to recognize whether or not a particular string is the result of uuencoding?


Solution

  • Wikipedia says that every uuencoded file begins with this line

    begin <perm> <name>
    

    So probably a line matching the regexp ^begin [0-7]{3} (.*)$ denotes the beginning reliably enough.