I am trying to read 30 CSV files, around 12Gb uncompressed and 30million rows, into a single Parquet file using DuckDB. I have upgraded today to the latest version 0.10.0.
Each one of my CSV is standardised so I am able to define the column names and types upfront and have done that as part of my query
COPY (SELECT * FROM read_csv('*.csv',
delim = ',',
header = false,
parallel = true,
auto_detect = false,
columns = {
'UPRN': 'BIGINT',
'UDPRN': 'INT',
'CHANGE_TYPE': 'VARCHAR(1)',
70 more attributes))
TO 'output.parquet' (FORMAT 'parquet', COMPRESSION 'zstd', ROW_GROUP_SIZE 100000)
Each time I run this I experience an Out of Memory error.
So I have since tried loading the CSV data into a persistent local database and then doing a COPY export
COPY table TO 'output.parquet' (FORMAT 'parquet', COMPRESSION 'zstd', ROW_GROUP_SIZE 100000)
Same out of memory error messages.
Error: Out of Memory Error: failed to allocate data of size 4.2 MiB (13.9 GiB/13.9 GiB used)
I read the recent release notes and there is a new Temporary Memory Manager which I can use and set a memory limit using
PRAGMA memory_limit = '10GB';
This definitely seems to make the loading more stable but has no impact in the Parquet export.
I am on a Windows laptop with 32Gb of RAM using the latest DuckDB CLI.
I am sure there are other Python type approaches to doing this but would prefer to stick with DuckDB as I am using it in another process and it is a great tool!
Try to turn the preserve_insertion_order option off:
SET preserve_insertion_order = false;
Moreover, a new pull request will help reduce the memory consumption of Parquet exports. This is not yet merged at the time of writing this post, but likely will be part of the upcoming DuckDB release in a few weeks.