Search code examples
apache-sparkhadoophdfsorc

How to alter ORC file's embedded schema?


​Is there a light-weight solution to change the datatype of specific column in ORC file without having to convert entire column datatype and re-writing entire orc file?

The following is a heavy-weight solution:

  1. Read orc file in Spark
  2. Convert datatype of a specific column
  3. Write converted orc file to HDFS

Looking for a light-weight solution where I can just alter embedded metadata info.

Thanks!


Solution

  • It's not the answer that you're looking for, but no you can't change a column type in ORC without re-generating the file. What you're suggesting is the correct way to do it.

    ORC includes indexes and aggregated values in the file header, and so changing a string -> double would require the entire column to be scanned so that the min/max/average etc could be calculated for what is now a numerical column.