Search code examples
pythonsocketsmultiprocessingpickleirc

Use multiprocessing to get information from multiple sockets


I'm working on developing a little irc client in python (ver 2.7). I had hoped to use multiprocessing to read from all servers I'm currently connected to, but I'm running into an issue

import socket
import multiprocessing as mp
import types
import copy_reg
import pickle


def _pickle_method(method):
    func_name = method.im_func.__name__
    obj = method.im_self
    cls = method.im_class
    return _unpickle_method, (func_name, obj, cls)

def _unpickle_method(func_name, obj, cls):
    for cls in cls.mro():
        try:
            func = cls.__dict__[func_name]
        except KeyError:
            pass
        else:
            break
    return func.__get__(obj, cls)

copy_reg.pickle(types.MethodType, _pickle_method, _unpickle_method)

class a(object):

    def __init__(self):
        sock1 = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock1.connect((socket.gethostbyname("example.com"), 6667))
        self.servers = {}
        self.servers["example.com"] = sock1

    def method(self, hostname):
        self.servers[hostname].send("JOIN DAN\r\n")
        print "1"

    def oth_method(self):
        pool = mp.Pool()
        ## pickle.dumps(self.method)
        pool.map(self.method, self.servers.keys())
        pool.close()
        pool.join()

if __name__ == "__main__":
    b = a()
    b.oth_method()

Whenever it hits the line pool.map(self.method, self.servers.keys()) I get the error

TypeError: expected string or Unicode object, NoneType found

From what I've read this is what happens when I try to pickle something that isn't picklable. To resolve this I first made the _pickle_method and _unpickle_method as described here. Then I realized that I was (originally) trying to pass pool.map() a list of sockets (very not picklable) so I changed it to the list of hostnames, as strings can be pickled. I still get this error, however.

I then tried calling pickle.dumps() directly on self.method, self.servers.keys(), and self.servers.keys()[0]. As expected it worked fine for the latter two, but from the first I get

TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled.

Some more research lead me to this question, which seems to indicate that the issue is with the use of sockets (and gnibbler's answer to that question would seem to confirm it).

Is there a way that I can actually use multiprocessing for this? From what I've (very briefly) read pathos.multiprocessing might be what I need, but I'd really like to stick to the standard library if at all possible.

I'm also not set on using multiprocessing - if multithreading would work better and avoid this issue then I'm more than open to those solutions.


Solution

  • Your root problem is that you can't pass sockets to child processes. The easy solution is to use threads instead.

    In more detail:


    Pickling a bound method requires pickling three things: the function name, the object, and the class. (I think multiprocessing does this for you automatically, but you're doing it manually; that's fine.) To pickle the object, you have to pickle its members, which in your case includes a dict whose values are sockets.

    You can't pickle sockets with the default pickling protocol in Python 2.x. The answer to the question you linked explains why, and provides the simple workaround: don't use the default pickling protocol. But there's an additional problem with socket; it's just a wrapper around a type defined in a C extension module, which has its own problems with pickling. You might be able to work around that as well…

    But that still isn't going to help. Under the covers, that C extension class is itself just a wrapper around a file descriptor. A file descriptor is just a number. Your operating system keeps a mapping of file descriptors to open sockets (and files and pipes and so on) for each process; file #4 in one process isn't file #4 in another process. So, you need to actually migrate the socket's file descriptor to the child at the OS level. This is not a simple thing to do, and it's different on every platform. And, of course, on top of migrating the file descriptor, you'll also have to pass enough information to re-construct the socket object. All of this is doable; there might even be a library that wraps it up for you. But it's not easy.


    One alternate possibility is to open all of the sockets before launching any of the children, and set them to be inherited by the children. But, even if you could redesign your code to do things that way, this only works on POSIX systems, not on Windows.


    A much simpler possibility is to just use threads instead of processes. If you're doing CPU-bound work, threads have problems in Python (well, CPython, the implementation you're almost certainly using) because there's a global interpreter lock that prevents two threads from interpreting code at the same time. But when your threads spend all their time waiting on socket.recv and similar I/O calls, there is no problem using threads. And they avoid all the overhead and complexity of pickling data and migrating sockets and so forth.

    You may notice that the threading module doesn't have a nice Pool class like multiprocessing does. Surprisingly, however, there is a thread pool class in the stdlib—it's just in multiprocessing. You can access it as multiprocessing.dummy.Pool.

    If you're willing to go beyond the stdlib, the concurrent.futures module from Python 3 has a backport named futures that you can install off PyPI. It includes a ThreadPoolExecutor which is a slightly higher-level abstraction around a pool which may be simpler to use. But Pool should also work fine for you here, and you've already written the code.