I see really weird behaviour when trying to do EXPORT TO PARQUET from Vertica to local drive. See example:
When I do
EXPORT TO PARQUET (directory = '/data') over (partition by install_date)
AS select field1, field2 from table where install_date >= 'some_date';
Vertica says:
Rows Exported
---------------
16 405 136
But when I read result parquet (using pyarrow or Spark), i get only 2 522 845 rows. oO
Directory size is about 59M.
After few hours of testing I've found that if I do the same command but with limit clause with enormous large value, it works!
EXPORT TO PARQUET (directory = '/data') over (partition by install_date)
AS select field1, field2 from table where install_date >= 'some_date' limit 10000000000;
Then Vertica exports the same 16 405 136 number of rows, and parquet has the same number of rows! And directory size is around 350M.
Any ideas why?
Thanks.
Are you running on multi node Vertica cluster? If so, then each node exports its "part" (segment) of the data to a file on its own /data directory.
You need to collect those files from all nodes or export to a shared location, e.g NFS / aws S3.
With LIMIT x
Vertica first collects the data to the node that executed the query, and then only this node exports the data into a file.
This approach does not take advantage of Vertica's mpp ability, and should be much slower.