Search code examples
pandasdataframestreamlitduckdb

Faster way to convert duckdb results to dataframe that is supported by streamlit?


Not sure if I am using DuckDB properly in my streamlit application. The whole point of using duckdb in my streamlit was to make it speedy and responsive enough.

import duckdb
import pandas as pd
import polars as pl
import time

conn = duckdb.connect()

# dummy.csv refers to a file that I created with 100 million rows for testing. 
3 gb dataset.

query = \
"""
CREATE TABLE dummy as (
select * from 'dummy.csv'
)
"""

# Creates table
conn.execute(query)

# Part 1: Testing with Duck DB show()
start_time = time.time()


query = \
"""
select * 
from dummy
"""

df = conn.sql(query).show()
df 
print("--- %s seconds ---" % (time.time() - start_time))

--- 0.013511896133422852 seconds --- duckdb doing its magic. This is the kind of speed i want.

start_time = time.time()

# Query 1
query = \
"""
select * 
from dummy
"""

# The reason i am doing this is to display the table in my streamlit application (This is just an illustration of how I convert duckdb results into dataframe which is the fed to streamlit application

df = conn.sql(query).df()
df
print("--- %s seconds ---" % (time.time() - start_time))

--- 6.356271505355835 seconds --- Too slow because I am converting to pandas dataframe i believe.

start_time = time.time()

# Query 1
query = \
"""
select * 
from dummy
"""

df = conn.execute(query).pl()
display(df)
print("--- %s seconds ---" % (time.time() - start_time))

--- 1.8795912265777588 seconds ---

Okay this is better but not happy.

Is there any faster way to convert data into dataframe that is supported by streamlit?

st.dataframe(df)

Solution

  • From the Python API docs:

    The result of the query is returned as a Relation. A relation is a symbolic representation of the query.

    The query is not executed until the result is fetched or requested to be printed to the screen.

    .show() just prints a summary of the Relation, it doesn't actually return anything.

    >>> foo = duckdb.sql("SELECT 1").show()
    ┌───────┐
    │   1   │
    │ int32 │
    ├───────┤
    │     1 │
    └───────┘
    
    >>> type(foo)
    NoneType
    

    As stated, the query is not executed.

    If we use a larger query, it is easier to see that we are just getting a "preview":

    >>> duckdb.sql("from 'large-file.parquet'").show()
    ┌──────────────┬─────────────────┬───┬─────────────┬────────┬────────────┐
    │ project_name │ project_version │ … │ skip_reason │ lines  │ repository │
    │   varchar    │     varchar     │   │   varchar   │ uint64 │   uint32   │
    ├──────────────┼─────────────────┼───┼─────────────┼────────┼────────────┤
    │ zz-pix       │ 0.0.7           │ … │             │      1 │        265 │
    │ zz-pix       │ 0.0.7           │ … │             │      2 │        265 │
    │ zz-pix       │ 0.0.7           │ … │             │      5 │        265 │
    │ zz-pix       │ 0.0.7           │ … │             │     17 │        265 │
    │ zz-pix       │ 0.0.7           │ … │             │    184 │        265 │
    │ zz-pix       │ 0.0.7           │ … │             │    201 │        265 │
    │ zz-pix       │ 0.0.7           │ … │             │     27 │        265 │
    │ zz-pix       │ 0.0.7           │ … │             │     45 │        265 │
    │ zz-pix       │ 0.0.7           │ … │             │     22 │        265 │
    │ zz-pix       │ 0.0.7           │ … │             │      7 │        265 │
    │  ·           │   ·             │ · │      ·      │      · │         ·  │
    │  ·           │   ·             │ · │      ·      │      · │         ·  │
    │  ·           │   ·             │ · │      ·      │      · │         ·  │
    │ zrb          │ 0.11.0          │ … │             │      6 │        265 │
    │ zrb          │ 0.11.0          │ … │             │     23 │        265 │
    │ zrb          │ 0.11.0          │ … │             │    173 │        265 │
    │ zrb          │ 0.11.0          │ … │             │    134 │        265 │
    │ zrb          │ 0.11.0          │ … │             │     44 │        265 │
    │ zrb          │ 0.11.0          │ … │             │     37 │        265 │
    │ zrb          │ 0.11.0          │ … │             │      6 │        265 │
    │ zrb          │ 0.11.0          │ … │             │      2 │        265 │
    │ zrb          │ 0.11.0          │ … │             │      2 │        265 │
    │ zrb          │ 0.11.0          │ … │             │      2 │        265 │
    ├──────────────┴─────────────────┴───┴─────────────┴────────┴────────────┤
    │ ? rows (>9999 rows, 20 shown)                     11 columns (5 shown) │
    └────────────────────────────────────────────────────────────────────────┘
      ^^^^^^
    

    The ? rows indicates that the query has not been executed yet as the total number of rows is unknown.

    Using .pl() you are materializing the result set which will execute the query.

    In Python terms, you could compare it to declaring/consuming a generator:

    results = (row for row in ("foo", "bar", "baz")) # returns immediately, doesn't compute
    
    list(results) # compute
    # ['foo', 'bar', 'baz']
    

    See also