Search code examples
scalaapache-sparkparquetorc

check if a file is an ORC file


I have a program with input expected to have an ORC file format.

I want to be able to check if the provided input is effectively an ORC file. Checking extension only is not enough because the user can omit the extension.

For Parquet for example, we can check if the first line contains "PAR1".

Is there an equivalent way for ORC ?


Solution

  • As mentionned by @Ed Elliott, ORC file contains the information in its tail. The 3 bytes before the last byte of an ORC file contain "ORC". Here is the code that did it for me:

    val mainPath = Paths.get(new URI(path)).toString
    val buffer = ByteBuffer.allocate(3)
    val channel = FileChannel.open(Paths.get(mainPath), StandardOpenOption.READ)
    channel.read(buffer, channel.size - 4)
    new String(buffer.array(), StandardCharsets.UTF_8).equals("ORC")
    

    Something that is worth mentionning is that time complexity of this read is O(1) provided that the number of bytes you will read is constant. So read won't iterate over the whole file.