Search code examples
csvcopyduckdb

DuckDB- copy large number of CSVs to table increasingly slow


I am moving over from sqlite to DuckDB. I am trying to add about 2tb of data, ~5 billion rows, stored in 10k csvs to a DuckDB table. The format, structure, size etc of each csv is identical.

The first csv copy-ed in 2 secs, halfway through it's now taking 15 seconds per csv, and continues to get increasingly slow. Presumably I am doing something which is causing more than just writing onto the end of the table.

I am using python on a windows machine. I also found that python was not clearing the ram unless I disconnected and reconnected to the db. I never had this issue with sqlite.

Below is my code. Thanks.

# Process Each CSV File
for i, file in enumerate(csv_files, start=1):
    file_start_time = time.time()

    # Print progress message
    print(f"Processing file {i}/{total_files}: {file}")

    # Copy csv file to table
    query = f"COPY search_data FROM '{file}'"
    con.execute(query)

    # cleanup
    if i % batch_size == 0:
        con.close()  # Disconnect from DuckDB
        gc.collect()  # Force memory cleanup
        con = duckdb.connect(raw_duckdb_path)


Solution

  • I'm posting what I've found and my workaround to this having played around with it for a couple of days in case anyone else comes across this issue in the future.

    The main solution I found was to combine a large number of CSVs, and then write that onto the table at once. This is in the spirit of the known overheads when inserting.

    I ran speed tests where I read a batch number of my 200mb CSVs into a list, combined them, wrote them back to disk as a large combined CSV and then imported that into my DuckDB table. I tested speeds using both INSERT INTO and COPY, via python.

    I tested batch sizes of 1, 5, 25, 50, 100, 250 and 500 CSVs. Note that the batch size of 500 was producing CSVs of ~100gb. For reference I'm running a workstation with 512gb of ram, 24 CPUs, and all IO was using m2 NVMe SSDs. CSVs have 44 columns of varying types (string, integer, date, double etc) and each has approximately 600-700k rows. Speeds reflect the entire operation. (connecting to db, reading and combining CSVs, INSERTing / COPYing, and disconnecting). Below is the code snippet.

    for batch_size in batch_sizes:
    print(f"Processing batch size: {batch_size}")
    
    # Start timing
    start_time = time.time()
    
    # Connect to DuckDB
    con = duckdb.connect(raw_duckdb_path)
    
    # Read CSVs into a list of DataFrames (up to batch_size files)
    df_list = [pl.read_csv(csv) for csv in csv_files[:min(batch_size, total_files)]]
    
    # Combine all DataFrames into one
    combined_df = pl.concat(df_list, how="diagonal")
    
    # Define output CSV path
    output_path = f"{output_folder}/combined_{batch_size}.csv"
    
    # Write the combined data to CSV
    combined_df.write_csv(output_path)
    
    # Insert CSV into DuckDB
    #sql = f"INSERT INTO search_buying SELECT * FROM read_csv_auto('{output_path}');"
    sql = f"COPY search_buying FROM '{output_path}' (AUTO_DETECT TRUE);"
    con.execute(sql)
    
    # Disconnect from DuckDB
    con.close()
    
    # End timing
    elapsed_time = time.time() - start_time
    
    # Compute implied time per CSV
    time_per_csv = elapsed_time / batch_size
    
    # Store results
    results.append({"batch_size": batch_size, "time_taken": elapsed_time, "time_per_csv": time_per_csv})
    

    Results are plotted below as well as in table form.

    enter image description here enter image description here

    Performance was optimised using batches of 100 and using INSERT. Batches of 100 CSVs results in combined files of around 20gb which I suspect is also more manageable for most machines in terms of ram. I suspect the slow down at high speeds is actually coming from longer write times of the combined CSV rather than the INSERT/COPY but I haven't tested that. These were tested on a table that already had around 500gb of CSV imported into it, to make sure they weren't going to exhibit the similar types of slow downs I discussed earlier as the table got bigger. While it may get slower due to some overhead with a bigger table, it will essentially be split across a lot more CSVs.

    I tried something similar in R using only INSERT, and the times were awful. To insert 1 CSV took 4 minutes, and even at a batch of 100 CSVs it was taking around 30 seconds per CSV. I was also finding disconnecting in R was taking ages, and there seemed to be major leaky ram issues. I thought it might be an issue with my duckdb library install / version of R. I did a fresh install of R, R tools and libraries, and the issues were unchanged.