Search code examples
pythonpostgresqlscrapytwisted

Nonblocking PostgreSQL ORM


I have a web scraper using Scrapy, which is built on Twisted using Deferreds. For each page scraped I want to insert to several different tables in a PostgreSQL database. I would like this insertion to be nonblocking. Is there a way to get nonblocking Postgres interaction with an ORM like SQLAlchemy?

I understand there is nonblocking Postgres interaction with alchimia or txpostgres, but neither offers ORM functionality. If the answer is it's not possible, I'll use one of those. Is one of them preferred over the other?


Solution

  • The way an ORM is built in Python is generally not amenable to asynchronous operation.

    A primary feature of Python ORMs that seems to be attractive is apparently normal attribute access to get and set values. However, both of these present problems.

    For attribute access to operate asynchronously, the obvious behavior is for something like database_object.some_field to evaluate to a Deferred. Then the application would wait on this to receive the actual value. The code ends up looking something like:

    d = database_object.some_field
    def got_some_field(result):
        print("DB obj some_field = {}".format(result))
    d.addCallback(got_some_field)
    

    Instead of a simple expression, you've got four lines of code. One could argue that this has entirely destroyed the value the ORM is meant to provide. It may be possible to work around this problem by pre-loading values from the database. However, it's not always desirable to pre-load everything (it can harm performance) and it's not always obvious it's correct to do so (if you start a transaction, pre-load some values, execute an operation that might change those values, then retrieve them from the Python object, what should happen?).

    Updating values is even worse. Whereas a synchronous ORM might let you:

    database_object.some_field = 3
    

    How would an asynchronous ORM work for something similar? Assignment can't produce a value at all. So you need another inconvenient pattern to be able to do updates correctly. Something like:

    d = database_object.update("some_field", 3)
    def updated(ignored):
        # Continue ...
    d.addCallback(updated)
    

    Or a very complicated system to do this behind the scenes in a way that doesn't allow for inconsistency between values in the database and values in memory in Python. It's probably not impossible but the degree of complexity involved may be why no one seems to have tackled the problem.