Search code examples
pythonioerror

Python IOError cannot allocate memory although there is plenty


I've written a basic program to check through a directory tree containing many jpeg files (500000+) verify that they are not corrupted (approximately 3-5% of the files seem to be corrupt in some way) and then take a sha1sum of the files (even the corrupt ones) and save the info into a database.

The jpeg files in question are located on a windows system and mounted on the linux box via cifs. They are mostly around 4 megabytes in size, although some maybe slightly larger or smaller.

When I run the program it seems to work fairly well for a while and then it falls over with the below error. This was after it had processed approximately 1100 files (the error indicated that the problem occurred when attempting to open a file of 4.5 meg).

Now I understand that I can catch this error and continue or retry etc but I'm curious as to why it is occurring in the first place and if catching and retrying is actually going to solve the problem - or will it just get stuck retrying (unless I limit the retries of course but then a file is being skipped).

I'm using "Python 2.7.5+" on a debian system to run this. The system has at least 4 Gig (possibly 8) of ram and top is reporting that the script is using less than 1% of the ram and less than 3% of the cpu at any time when it is running. Similarly jpeginfo which this script runs is also using equally small amounts of memory and cpu.

To avoid using too much memory when reading files in I have taken the approach given in this answer to another question: https://stackoverflow.com/a/1131255/289545

Also you may note that the "jpeginfo" command is in a while loop looking for an "[OK]" response. This is because if "jpeginfo" thinks it can't find the file it returns a 0 and so it is not considered an error state by the subprocess.check_output call.

I did wonder if the fact that jpeginfo seems to fail to find certain files on the first try could be related (and I suspect it is) but the error returned says cannot allocate memory rather than file not found.

The Error:

Traceback (most recent call last):
  File "/home/m3z/jpeg_tester", line 95, in <module>
    main()
  File "/home/m3z/jpeg_tester", line 32, in __init__
    self.recurse(self.args.dir, self.scan)
  File "/home/m3z/jpeg_tester", line 87, in recurse
    cmd(os.path.join(root, name))
  File "/home/m3z/jpeg_tester", line 69, in scan
    with open(filepath) as f:
IOError: [Errno 12] Cannot allocate memory: '/path/to/file name.jpg'

The full program code:

  1 #!/usr/bin/env python
  2
  3 import os
  4 import time
  5 import subprocess
  6 import argparse
  7 import hashlib
  8 import oursql as sql
  9
 10
 11
 12 class main:
 13     def __init__(self):
 14         parser = argparse.ArgumentParser(description='Check jpeg files in a given directory for errors')
 15         parser.add_argument('dir',action='store', help="absolute path to the directory to check")
 16         parser.add_argument('-r, --recurse', dest="recurse", action='store_true', help="should we check subdirectories")
 17         parser.add_argument('-s, --scan', dest="scan", action='store_true', help="initiate scan?")
 18         parser.add_argument('-i, --index', dest="index", action='store_true', help="should we index the files?")
 19
 20         self.args = parser.parse_args()
 21         self.results = []
 22
 23         if not self.args.dir.startswith("/"):
 24                 print "dir must be absolute"
 25                 quit()
 26
 27         if self.args.index:
 28                 self.db = sql.connect(host="localhost",user="...",passwd="...",db="fileindex")
 29                 self.cursor = self.db.cursor()
 30
 31         if self.args.recurse:
 32                 self.recurse(self.args.dir, self.scan)
 33         else:
 34                 self.scan(self.args.dir)
 35
 36         if self.db:
 37                 self.db.close()
 38
 39         for line in self.results:
 40                 print line
 41
 42
 43
 44     def scan(self, dirpath):
 45         print "Scanning %s" % (dirpath)
 46         filelist = os.listdir(dirpath)
 47         filelist.sort()
 48         total = len(filelist)
 49         index = 0
 50         for filen in filelist:
 51                 if filen.lower().endswith(".jpg") or filen.lower().endswith(".jpeg"):
 52                         filepath = os.path.join(dirpath, filen)
 53                         index = index+1
 54                         if self.args.scan:
 55                                 try:
 56                                         procresult = subprocess.check_output(['jpeginfo','-c',filepath]).strip()
 57                                         while "[OK]" not in procresult:
 58                                                 time.sleep(0.5)
 59                                                 print "\tRetrying %s" % (filepath)
 60                                                 procresult = subprocess.check_output(['jpeginfo','-c',filepath]).strip()
 61                                         print "%s/%s: %s" % ('{:>5}'.format(str(index)),total,procresult)
 62                                 except subprocess.CalledProcessError, e:
 63                                         os.renames(filepath, os.path.join(dirpath, "dodgy",filen))
 64                                         filepath = os.path.join(dirpath, "dodgy", filen)
 65                                         self.results.append("Trouble with: %s" % (filepath))
 66                                         print "%s/%s: %s" % ('{:>5}'.format(str(index)),total,e.output.strip())
 67                         if self.args.index:
 68                                 sha1 = hashlib.sha1()
 69                                 with open(filepath) as f:
 70                                         while True:
 71                                                 data = f.read(8192)
 72                                                 if not data:
 73                                                         break
 74                                                 sha1.update(data)
 75                                 sqlcmd = ("INSERT INTO `index` (`sha1`,`path`,`filename`) VALUES (?, ?, ?);", (buffer(sha1.digest()), dirpath, filen))
 76                                 self.cursor.execute(*sqlcmd)
 77
 78
 79     def recurse(self, dirpath, cmd, on_files=False):
 80         for root, dirs, files in os.walk(dirpath):
 81             if on_files:
 82                 for name in files:
 83                     cmd(os.path.join(root, name))
 84             else:
 85                 cmd(root)
 86                 for name in dirs:
 87                     cmd(os.path.join(root, name))
 88
 89
 90
 91
 92
 93
 94 if __name__ == "__main__":
 95     main()

Solution

  • It looks to me like Python is just passing on an error from the underlying open() call and the real culprit here is the Linux CIFS support - I doubt Python would be synthesizing ENOMEM unless system memory was truly exhausted (and probably even then I'd expect the Linux OOM killer to be invoked instead of getting ENOMEM).

    Unfortunately it might need something of a Linux filesystem expert to figure out what's going on there, but looking at the sources for CIFS in the Linux kernel, I can see a variety of places where ENOMEM is returned when various kernel-specific resources are exhausted as opposed to total system memory, but I'm not familiar enough with it to say how likely any of them are.

    To rule out anything Python-specific you can run the process under strace so you can see the exact return code that Python is getting from Linux. To do this, run your command something like this:

    strace -eopen -f python myscript.py myarg1 myarg2 2>strace.log
    

    The -f will follow child processes (i.e. the jpeginfo commands that you run) and the -eopen will only show you open() calls as opposed to all system calls (which is what strace does by default). This could generate a reasonable amount of output, which is why I've redirected it to a file in the above example, but you can leave it displaying on your terminal if you prefer.

    I would expect you'd see something like this just before you get your exception:

    open("/path/to/file name.jpg", O_RDONLY) = -1 ENOMEM (Cannot allocate memory)
    

    If so, this error is coming straight from the filesystem open() call and there's very little you can do about it in your Python script. You could catch the exception and retry (perhaps after a short delay) as you're already doing if jpeginfo fails, but it's hard to say how successful this strategy will be without knowing what's causing the errors in the first place.

    You could, of course, copy the files locally, but it sounds like that would be a serious pain as there are so many.

    EDIT: As an aside, you'll expect to see lots of open() calls which are nothing to do with your script because strace is tracing every call made by Python, which includes it opening its own .py and .pyc files, for example. Just ignore the ones which don't refer to the files you're interested in.