Convert JSON to ORC using orc-tools

I am trying to convert JSON file using the orc tools jar mentioned on

https://orc.apache.org/docs/tools.html#java-orc-tools

I have imported this in my pom.xml

<dependency>
    <groupId>org.apache.orc</groupId>
    <artifactId>orc-tools</artifactId>
    <version>1.3.1</version>
</dependency>

However, after the import, I am unable to see/import the class org.apache.orc.tools.json.JsonSchemaFinder which is used to infer the schema from JSON files.

Example using the above class can be seen in this commit. https://github.com/apache/orc/pull/95/commits/2ee0be7e60e7ca77f574110ba1babfa2a8e93f3f

Am I using the wrong jar here?

Solution

This is scheduled to release in 1.4.0 version of ORC. Current version 1.3.x doesnt include these features.

You can still get the ORC git branch, copy out the org.apache.orc.tools.convert and org.apache.orc.tools.json to your repo and use these features. Alternatively, you can also make a jar from the ORC repo and use it too.

public static void main(Configuration conf,
                       String[] args) throws IOException, ParseException {
 CommandLine opts = parseOptions(args);
 TypeDescription schema;
 if (opts.hasOption('s')) {
   schema = TypeDescription.fromString(opts.getOptionValue('s'));
 } else {
   schema = computeSchema(opts.getArgs());
 }
 String outFilename = opts.hasOption('o')
     ? opts.getOptionValue('o') : "output.orc";
 Writer writer = OrcFile.createWriter(new Path(outFilename),
     OrcFile.writerOptions(conf).setSchema(schema));
 VectorizedRowBatch batch = schema.createRowBatch();
 for (String file: opts.getArgs()) {
   System.err.println("Processing " + file);
   RecordReader reader = new JsonReader(new Path(file), schema, conf);
   while (reader.nextBatch(batch)) {
     writer.addRowBatch(batch);
   }
   reader.close();
 }
 writer.close();
}