Search code examples
pythonexportparquetverticapyarrow

inconsistent schema when reading parquet and exporting from Vertica


I've noticed weird behaviour when exporting data from Vertica and trying to read it later with parquet (python). Let's say I want to have table dump to parquet:

EXPORT TO PARQUET (directory = '/data/table_name') over (partition by event_date) 
AS select * from table;

it gives me next structure:

/data/table_name
 - event_date=2019-01-01
 - event_date=2019-01-02
 - event_date=2019-01-03
...

Then I'm trying to read it with pyarrow:

import pyarrow.parquet as pq
df = pq.read_table('/data/table_name')

But I'm getting an error of inconsistent schema:

ValueError: Schema in partition[event_date=0] ./event_date=2019-01-01/84087de6-node0001-139759025940222.parquet was different.
user_id: string
event_id: int64
event_name: string
install_date: int32
event_date: int32
site_id: string

vs

user_id: string
event_id: int64
event_name: string
install_date: int32
site_id: string

How come?

P.S. If I read each dir separately - it works fine.

df1 = pq.read_table('/data/table_name/event_date=2019-01-01')
df2 = pq.read_table('/data/table_name/event_date=2019-01-02')
df3 = pq.read_table('/data/table_name/event_date=2019-01-02')

df1.schema == df2.schema == df3.schema
> True

Solution

  • You need to exclude the partition column (event_date) from the export query:

    EXPORT TO PARQUET (directory = '/data/table_name') over (partition by event_date) 
    AS SELECT user_id,
              event_id,
              event_name,
              install_date,
              site_id
    FROM table;