Search code examples
duckdb

Writing .parquet from duckdb prefixes column names with "PARGO_PREFIX_"


DuckDB is changing my column names as I write out to .parquet file, and I can't figure out why.

In a DuckDB memory only instance (on Ubuntu 23.04) I run:

create table mytable (_id int, str varchar, num int);
insert into mytable (_id, str, num) values (1, 'one', 1);
insert into mytable (_id, str, num) values (2, 'two', 2);
copy (select * from mytable) to 'mytable_splat.parquet' (format 'parquet');
copy (select _id, str, num from mytable) to 'mytable_colnames.parquet' (format 'parquet');
copy (select _id as "id", str, num from mytable) to 'mytable_renamecols.parquet' (format 'parquet');
copy (select * from mytable) to 'mytable.csv' (format csv, header true);

I then use parquet-tools to show the contents of the .parquet files:

$ parquet-tools cat mytable_splat.parquet -f csv
PARGO_PREFIX__id,Str,Num
1,one,1
1,two,2
$ parquet-tools cat mytable_colnames.parquet -f csv
PARGO_PREFIX__id,Str,Num
1,one,1
1,two,2
$ parquet-tools cat mytable_renamecols.parquet -f csv
Id,Str,Num
1,one,1
1,two,2
$ cat mytable.csv
_id,str,num
1,one,1
1,two,2

There's two issues here:

  • The Parquet output (but not CSV output) capitalizes the first letter of each column that doesn't start with an underscore _
  • The Parquet output (but not CSV output) renames columns starting with an underscore, to prefix PARGO_PREFIX_ on to the column. _id becomes PARGO_PREFIX__id.

I'm using DuckDB to write data from Node.js to .parquet files as suggested by this post: NodeJS Parquet write This seems like a more robust solution than the several parquetjs based NPM libraries, but I can't understand or control why these columns are being renamed in .parquet files only. The root issue is in DuckDB itself, not the NodeJS code. Any ideas?


Solution

  • @Mause answered correctly that DuckDB isn't creating the .parquet file with these incorrect column names, parquet-tools was changing the names while printing the contents.

    The easiest way for me to validate this was to add a Parquet file inspecter into VS Code, I used parquet-viewer *(I'm sure there's others that would work as well).

    With this extension, I could see that the contents of each of the files that I created in my example above are:

    {"_id":1,"str":"one","num":1}
    {"_id":1,"str":"two","num":2}
    

    @Mause, if you'd like to post your own answer I'd be happy to accept it. I'm just posting this one to provide some closure to the question.