Search code examples
pythonpandasparquetpyarrowduckdb

Importing parquet file in chunks and insert in DuckDB


I am trying to load the parquet file with row size group = 10 into duckdb table in chunks. I am not finding any documents to support this.

This is my work so on: see code

import duckdb
import pandas as pd
import gc
import numpy as np

# connect to an in-memory database
con = duckdb.connect(database='database.duckdb', read_only=False)

df1 = pd.read_parquet("file1.parquet")
df2 = pd.read_parquet("file2.parquet")

# create the table "my_table" from the DataFrame "df1"
con.execute("CREATE TABLE table1 AS SELECT * FROM df1")

# create the table "my_table" from the DataFrame "df2"
con.execute("CREATE TABLE table2 AS SELECT * FROM df2")

con.close()
gc.collect()

Please help me load both the tables with parquet files with row group size or chunks. ALso, load the data to duckdb as chunks


Solution

  • df1 = pd.read_parquet("file1.parquet")
    

    This statement will read the entire parquet file into memory. Instead, I assume you want to read in chunks (i.e one row group after another or in batches) and then write the data frame into DuckDB.

    This is not possible as of now using pandas. You can use something like pyarrow (or fast parquet) to do this. Here is an example from pyarrow docs.

    iter_batches can be used to read streaming batches from a Parquet file. This can be used to read in batches, read certain row groups or even certain columns.

    import pyarrow.parquet as pq
    parquet_file = pq.ParquetFile('example.parquet')
    for i in parquet_file.iter_batches(batch_size=10):
        print("RecordBatch")
        print(i.to_pandas())
    

    Above example simply reads 10 records at a time. You can further limit this to certain row groups or even certain columns like below.

    for i in parquet_file.iter_batches(batch_size=10, columns=['user_address'], row_groups=[0,2,3]):
    

    Hope this helps!