Search code examples
csvhiveprestotrino

Export non-varchar data to CSV table using Trino (formerly PrestoDB)


I am working on some benchmarks and need to compare ORC, Parquet and CSV formats. I have exported TPC/H (SF1000) to ORC based tables. When I want to export it to Parquet I can run:

CREATE TABLE hive.tpch_sf1_parquet.region
  WITH (format = 'parquet')
  AS SELECT * FROM hive.tpch_sf1_orc.region

When I try the similar approach with CSV, then I get the error Hive CSV storage format only supports VARCHAR (unbounded). I would assumed that it would convert the other datatypes (i.e. bigint) to text and store the column format in the Hive metadata.

I can export the data to CSV using trino --server trino:8080 --catalog hive --schema tpch_sf1_orc --output-format=CSV --execute 'SELECT * FROM nation, but then it gets emitted to a file. Although this works for SF1 it quickly becomes unusable for SF1000 scale-factor. Another disadvantage is that my Hive metastores wouldn't have the appropriate meta-data (although I could patch it manually if nothing else works).

Anyone an idea how to convert my ORC/Parquet data to CSV using Hive?


Solution

  • In Trino Hive connector, the CSV table can contain varchar columns only.

    You need to cast the exported columns to varchar when creating the table

    CREATE TABLE region_csv
    WITH (format='CSV')
    AS SELECT CAST(regionkey AS varchar), CAST(name AS varchar), CAST(comment AS varchar)
    FROM region_orc
    

    Note that you will need to update your benchmark queries accordingly, e.g. by applying reverse casts.