I am very new to big data and Tika. I just want to know is there a way to convert a word document(.doc) to JSON format. I heard that a morphline using java needs to coded to do this, but i dont know Java, Is there any solution available to this.
I will be using Tika in Apache SolR.
Like following you can extract xml with ToXMLContentHandler and then convert to json
More examples here
public String parseBodyToHTML(InputStream stream) throws IOException, SAXException, TikaException {
ContentHandler handler = new BodyContentHandler(
new ToXMLContentHandler());
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
parser.parse(stream, handler, metadata);
return handler.toString();
}
another option would be write a JsonHandler for yourself ContentHandler