Search code examples
pythonobserver-patternproducer-consumer

Is this an insane implementation of producer consumer type thing?


# file1.py

class _Producer(self):

  def __init__(self):
    self.chunksize = 6220800
    with open('/dev/zero') as f:
      self.thing = f.read(self.chunksize)
    self.n = 0
    self.start()

  def start(self):
    import subprocess
    import threading

    def produce():
      self._proc = subprocess.Popen(['producer_proc'], stdout=subprocess.PIPE)
      while True:
        self.thing = self._proc.stdout.read(self.chunksize)
        if len(self.thing) != self.chunksize:
          msg = 'Expected {0} bytes.  Read {1} bytes'.format(self.chunksize, len(self.thing))
          raise Exception(msg)
        self.n += 1

    t = threading.Thread(target=produce)
    t.daemon = True
    t.start()
    self._thread = t

  def stop(self):
    if self._thread.is_alive():
      self._proc.terminate()
      self._thread.join(1)

producer = _Producer()
producer.start()

I have written some code more or less like the above design, and now I want to be able to consume the output of producer_proc in other files by going:

# some_other_file.py
import file1
my_thing = file1.producer.thing 

Multiple other consumers might be grabbing a reference to file.producer.thing, they all need to use from the same producer_proc. And the producer_proc should never be blocked. Is this a sane implementation? Does the python GIL make it thread safe, or do I need to reimplement using a Queue for getting data of the worker thread? Do consumers need to explicitly make a copy of the thing?

I guess am trying to implement something like Producer/Consumer pattern or Observer pattern, but I'm not really clear on all the technical details of design patterns.

  • A single producer is constantly making things
  • Multiple consumers using things at arbitrary times
  • producer.thing should be replaced by a fresh thing as soon as the new one is available, most things will go unused but that's ok
  • It's OK for multiple consumers to read the same thing, or to read the same thing twice in succession. They only want to be sure they have got the most recent thing when asked for it, not some stale old thing.
  • A consumer should be able to keep using a thing as long as they have it in scope, even though the producer may have already overwritten his self.thing with a fresh new thing.

Solution

  • Given your (unusual!) requirements, your implementation seems correct. In particular,

    • If you're only updating one attribute, the Python GIL should be sufficient. Single bytecode instructions are atomic.
    • If you do anything more complex, add locking! It's basically harmless anyway - if you cared about performance or multicore scalability, you probably wouldn't be using Python!
    • In particular, be aware that self.thing and self.n in this code are updated in a separate bytecode instructions. The GIL could be released/acquired between, so you can't get a consistent view of the two of them unless you add locking. If you're not going to do that, I'd suggest removing self.n as it's an "attractive nuisance" (easily misused) or at least adding a comment/docstring with this caveat.
    • Consumers don't need to make a copy. You're not ever mutating a particular object pointed to by self.thing (and couldn't with string objects; they're immutable) and Python is garbage-collected, so as long as a consumer grabbed a reference to it, it can keep accessing it without worrying too much about what other threads are doing. The worst that could happen is your program using a lot of memory from several generations of self.thing being kept alive.

    I'm a bit curious where your requirements came from. In particular, that you don't care if a thing is never used or used many times.