Search code examples
exportparquetvertica

Vertica does not export to parquet all the data without limit clause


I see really weird behaviour when trying to do EXPORT TO PARQUET from Vertica to local drive. See example:

When I do

EXPORT TO PARQUET (directory = '/data') over (partition by install_date) 
    AS select field1, field2 from table where install_date >= 'some_date';

Vertica says:

Rows Exported
---------------
  16 405 136

But when I read result parquet (using pyarrow or Spark), i get only 2 522 845 rows. oO

Directory size is about 59M.

After few hours of testing I've found that if I do the same command but with limit clause with enormous large value, it works!

EXPORT TO PARQUET (directory = '/data') over (partition by install_date) 
   AS select field1, field2 from table where install_date >= 'some_date' limit 10000000000;

Then Vertica exports the same 16 405 136 number of rows, and parquet has the same number of rows! And directory size is around 350M.

Any ideas why?

Thanks.


Solution

  • Are you running on multi node Vertica cluster? If so, then each node exports its "part" (segment) of the data to a file on its own /data directory.

    You need to collect those files from all nodes or export to a shared location, e.g NFS / aws S3.

    With LIMIT x Vertica first collects the data to the node that executed the query, and then only this node exports the data into a file.

    This approach does not take advantage of Vertica's mpp ability, and should be much slower.