Search code examples
python-2.7twisted

Twisted getPage, exceptions.OSError: [Errno 24] Too many open files


I'm trying to run the following script with about 3000 items. The script takes the link provided by self.book and returns the result using getPage. It loops through each item in self.book until there are no more items in the dictionary.

Here's the script:

from twisted.internet import reactor
from twisted.web.client import getPage
from twisted.web.error import Error
from twisted.internet.defer import DeferredList

import logging
from src.utilitybelt import Utility

class getPages(object):
    """ Return contents from HTTP pages """

    def __init__(self, book, logger=False):
        self.book = book
        self.data = {}
        util = Utility()
        if logger:
            log = util.enable_log("crawler")

    def start(self):
        """ get each page """
        for key in self.book.keys():
            page = self.book[key]
            logging.info(page)

            d1 = getPage(page)

            d1.addCallback(self.pageCallback, key)
            d1.addErrback(self.errorHandler, key)
            dl = DeferredList([d1])

        # This should stop the reactor
        dl.addCallback(self.listCallback)

    def errorHandler(self,result, key):
        # Bad thingy!
        logging.error(result)
        self.data[key] = False
        logging.info("Appended False at %d" % len(self.data))

    def pageCallback(self, result, key):
        ########### I added this, to hold the data:
        self.data[key] = result
        logging.info("Data appended")
        return result

    def listCallback(self, result):
        #print result
        # Added for effect:
        if reactor.running:
            reactor.stop()
            logging.info("Reactor stopped")

About halfway through, I experience this error:

 File "/usr/lib64/python2.7/site-packages/twisted/internet/posixbase.py",       line 303, in _handleSignals

 File "/usr/lib64/python2.7/site-packages/twisted/internet/posixbase.py", line 205, in __init__

 File "/usr/lib64/python2.7/site-packages/twisted/internet/posixbase.py", line 138, in __init__

exceptions.OSError: [Errno 24] Too many open files
libgcc_s.so.1 must be installed for pthread_cancel to work
libgcc_s.so.1 must be installed for pthread_cancel to work

As of right now, I'll try to run the script with less items to see if that resolves the issue. However, there must be a better way to do it & I'd really like to learn.

Thank you for your time.


Solution

  • It looks like you are hitting the open file descriptors limit (ulimit -n) which is likely to be 1024. Each new getPage call opens a new file handle which maps to the client TCP socket opened for the HTTP request. You might want to limit the amount of getPage calls you run concurrently. Another way around is to up the file descriptor limit for your process, but then you might still exhaust ports or FDs if self.book grows beyond 32K items.