Search code examples
pythonpython-polars

read_sql in chunks with polars


I am trying to read a large database table with polars. Unfortunately, the data is too large to fit into memory and the code below eventually fails.

Is there a way in polars how to define a chunksize, and also write these chunks to parquet, or use the lazy dataframe interface to keep the memory footprint low?

import polars as pl
df = pl.read_sql("SELECT * from TABLENAME", connection_string) 
df.write_parquet("output.parquet")

Solution

  • Yes and no.

    There's not a predefined method to do it but you can certainly do it yourself. You'd do something like:

    rows_at_a_time=1000
    curindx=0
    while True:
        df = pl.read_sql(f"SELECT * from TABLENAME limit {curindx},{rows_at_a_time}", connection_string) 
        if df.shape[0]==0:
            break
        df.write_parquet(f"output{curindx}.parquet")
        curindx+=rows_at_a_time
    ldf=pl.concat([pl.scan_parquet(x) for x in os.listdir(".") if "output" in x and "parquet" in x])
    

    This borrows limit syntax from this answer assuming you're using mysql or a db that has the same syntax which isn't trivial assumption. You may need to do something like this if not using mysql.

    Otherwise you just read your table in chunks, saving each chunk to a local file. When the chunk you get back from your query has 0 rows then it stops looping and loads all the files to a lazy df.

    You can almost certainly (and should) increase the rows_at_a_time to something greater than 1000 but that's dependent on your data and computer memory.