Search code examples
pythonduckdb

Load a saved on disk duckdb instance into a new in memory duckdb instance


I'm working on a project where, in a first stage, I pull some raw data, do a bunch of processing of it in duckdb, and end up with a bunch of tables that are used by a bunch of downstream components that also operate in duckdb.

I'd like to have the outputs from the first stage persist on disk, and not be modified by the downstream components, but the downstream components need to at least be able to create views and temporary tables. Further, there's no reason to have the downstream components operate on-disk... the data is small enough to fit in memory.

I'd like a magical solution like

conn = duckdb.connect(":memory:")
conn.load_from_disk(path_to_on_disk)

but nothing like that seems to exist. I can read each table from the on disk connection, convert to pandas, then load into the in memory connection, but that takes forever.

Any ideas?

Example of that inefficient approach:


def load_disk_duck_to_mem_duck(path: pathlib.Path) -> duckdb.DuckDBPyConnection:
    """Slow and ugly!"""
    source_db = duckdb.connect((path / "duck.db").as_posix())
    in_memory_db = duckdb.connect(":memory:")

    tables = source_db.execute("SHOW TABLES").fetchall()

    # Copy each table from on-disk to in-memory
    for table in tables:
        table_name = table[0]
        temp_df = source_db.table(table_name).df()
        # Load the table from on-disk and create a copy in the in-memory database
        in_memory_db.from_df(temp_df).create(table_name)
    return in_memory_db

Solution

  • I'd like to have the outputs from the first stage persist on disk

    This can be accomplished using a DuckDB EXPORT DATABASE statement.

    ... the downstream components need to at least be able to create views and temporary tables. Further, there's no reason to have the downstream components operate on-disk... the data is small enough to fit in memory.

    So each downstream component should read the exported database, e.g. using the IMPORT DATABASE statement.


    Alternatively, if you want a single file, you can use ATTACH and COPY DATABASE, like so:

    attach 'test.db' as test;
    copy from database memory to test;
    

    This assumes the specified file ('test.db' in the example) does not already exist.