Search code examples
pythonsynchronizationlocking

How can I lock files on AWS S3?


By locking, I don't mean the Object Lock S3 makes available. I'm talking about the following situation:

I have multiple (Python) processes that read and write to a single file hosted on S3; maybe the file is an index of sorts that needs to be updated periodically.

The processes run in parallel, so I want to make sure only a single process can ever write to the file at a given time (to avoid concomitant write clobbering data).

If I was writing this to a shared filesystem, I could just ask use flock and use that as a way to sync access to the file, but I can't do that on S3 afaict.

What is the easiest way to lock files on AWS S3?


Solution

  • Unfortunately, AWS S3 does not offer a native way of locking objects - there's no flock analogue, as you pointed out. Instead you have a few options:

    Use a database

    For example, Postgres offers advisory locks. When setting this up, you will need to do the following:

    1. Make sure all processes can access the database.
    2. Make sure the database can handle the incoming connections (if you're running some type of large processing grid, then you may want to put your Postgres instance behind PGBouncer)
    3. Be careful that you do not close the session from the client before you're done with the lock.

    There are a few other caveats you need to consider when using advisory locks - from the Postgres documentation:

    Both advisory locks and regular locks are stored in a shared memory pool whose size is defined by the configuration variables max_locks_per_transaction and max_connections. Care must be taken not to exhaust this memory or the server will be unable to grant any locks at all. This imposes an upper limit on the number of advisory locks grantable by the server, typically in the tens to hundreds of thousands depending on how the server is configured.

    In certain cases using advisory locking methods, especially in queries involving explicit ordering and LIMIT clauses, care must be taken to control the locks acquired because of the order in which SQL expressions are evaluated

    Use an external service

    I've seen people use something like lockable to solve this issue. From their docs, they seem to have a Python library:

    $ pip install lockable-dev

    from lockable import Lock
    
    with Lock('my-lock-name'):
        #do stuff
    

    If you're not using Python, you can still use their service by hitting some HTTP endpoints:

    curl https://api.lockable.dev/v1/acquire/my-lock-name
    
    curl https://api.lockable.dev/v1/release/my-lock-name