Search code examples
parquetvertica

Vertica - What is the best practice for exporting to Parquet


I have a Vertica DB that suppose to be used for my "Hot" data. I figure out that i can save my "Cold" data as parquet files on HDFS as an external table, i also succeed to export my data to Parquet file. I have a few questions regarding this process: * exporting to parquet - how to do it often ? do i need to run the queries as cron tab?
* when I create an external table does Vertica knows to "connect" it to my main table? I mean when running any query Verica will search also in external tables related to the main table where the "hot" data stored.


Solution

  • If you want to do periodic exports you'll need to use an external scheduler (like cron). You can export to HDFS, NFS, or S3.

    If you have some data (your hot data) in Vertica and other (cold) data in external tables, those are two separate tables. You can query them together (select... from hot,cold), but your external Parquet data isn't "in" the table that holds your Vertica (ROS, hot) data.

    When you query an external table, Vertica reads the relevant data from the external location. In a sense, it's doing a lightweight load using the FROM clause of the CREATE EXTERNAL TABLE expression on each query. (The Parquet columnar format and optimizations like predicate pushdown mean that's not painful.) This means that if your FROM clause is a glob (like hdfs:///data/*/*.parquet), if you export more data you don't need to update the table definition -- it just works.

    Note that each export needs to be into a new directory; if you put them all under the same parent directory, then you can use a glob as in my example. See the documentation.