pythonpython-3.xsqliteamazon-s3in-memory-database

Connect to a database in cloud


I have an SQLite database (110kb) in an S3 bucket. I want to connect to that database every time I run my Python application.

An option is to download database everytime I run the Python application and connect it. But I want to know if there exists a way to connect to that SQLite database through memory, using S3FileSystem and open.

I'm using SQLite3 library in Python 3.6.


Solution

  • As other answers indicate, you probably don't want to use SQLite as a primary database in the cloud.

    However, as part of a fun side project I wrote an Amazon Athena data source connector that allows you to query SQLite databases in S3 from Athena. In order to do that, I wrote a read-only SQLite interface for S3.

    SQLite has a concept of an OS Interface or VFS. Using a Python SQLite wrapper called APSW, you can write a VFS implementation for arbitrary filesystems. This is what I did in my project and I've included the implementation below.

    In order to use this, you would first register the VFS and then create a new SQLite connection with this implementation as the driver.

    I should note this isn't optimized at all, so will likely still require reading full databases from S3 depending on your queries. But doesn't sound like an issue in this specific case.

    S3FS = S3VFS()  # S3VFS defined below
    
    # This odd format is used due to SQLite requirements
    sqlite_uri = "file:/{}/{}.sqlite?bucket={}&immutable=1".format(
      S3_PREFIX,
      DATABASE_NAME,
      S3_BUCKET
    )
    
    connection = apsw.Connection(sqlite_uri,
      flags=apsw.SQLITE_OPEN_READONLY | apsw.SQLITE_OPEN_URI,
      vfs=S3FS.vfsname
    )
    cursor = connection.cursor()
    

    Once you have the cursor, you can execute standard SQL statements like so:

    for x,y,z in cursor.execute("select x,y,z from foo"):
        print (cursor.getdescription())  # shows column names and declared types
        print (x,y,z)
    

    VFS Implementation (requires APSW library and boto3 for S3 connectivity)

    import apsw
    import sys
    import boto3
    
    VFS_S3_CLIENT = boto3.client('s3')
    
    
    class S3VFS(apsw.VFS):
        def __init__(self, vfsname="s3", basevfs=""):
            self.vfsname=vfsname
            self.basevfs=basevfs
            apsw.VFS.__init__(self, self.vfsname, self.basevfs)
    
        def xOpen(self, name, flags):
            return S3VFSFile(self.basevfs, name, flags)
    
    
    class S3VFSFile():
        def __init__(self, inheritfromvfsname, filename, flags):
            self.bucket = filename.uri_parameter("bucket")
            self.key = filename.filename().lstrip("/")
            print("Initiated S3 VFS for file: {}".format(self._get_s3_url()))
    
        def xRead(self, amount, offset):
            response = VFS_S3_CLIENT.get_object(Bucket=self.bucket, Key=self.key, Range='bytes={}-{}'.format(offset, offset + amount))
            response_data = response['Body'].read()
            return response_data
    
        def xFileSize(self):
            client = boto3.client('s3')
            response = client.head_object( Bucket=self.bucket, Key=self.key)
            return response['ContentLength']
    
        def xClose(self):
            pass
    
        def xFileControl(self, op, ptr):
            return False
    
        def _get_s3_url(self):
            return "s3://{}/{}".format(self.bucket, self.key)