Search code examples
hadoophdfsorc

How to get the schema (columns and their types) of ORC files stored in HDFS?


I have ORC files stored in different folders on HDFS as follows:

/DATA/UNIVERSITY/DEPT/STUDENT/part-00000.orc
/DATA/UNIVERSITY/DEPT/CREDIT/part-00000.orc

I do not know how many columns are present in each of those tables (STUDENT, CREDIT, etc). Is there a way to get the schemas from these files? I'm looking for obtaining columns names and their data types so that I could write CREATE statements for Hive EXTERNAL tables.


Solution

  • Found a way to get the details (column names, data types) I wanted via Spark

    data = sqlContext.sql("SELECT * FROM orc.`<HDFS_path>`");
    data.printSchema()
    

    This will print the output in below format, which is exaclty the information I want to extract from the ORC files on HDFS:

    root
     |-- <column_name1>: <datatype> (nullable = <true/false>)
     |-- <column_name2>: <datatype> (nullable = <true/false>)
     |-- <column_name3>: <datatype> (nullable = <true/false>)
     |-- <column_name4>: <datatype> (nullable = <true/false>)
     |-- <column_name5>: <datatype> (nullable = <true/false>)