Search code examples
solrsolrjapache-tika

How to convert a word document into Json in apache tika


I am very new to big data and Tika. I just want to know is there a way to convert a word document(.doc) to JSON format. I heard that a morphline using java needs to coded to do this, but i dont know Java, Is there any solution available to this.

I will be using Tika in Apache SolR.


Solution

  • Like following you can extract xml with ToXMLContentHandler and then convert to json

    More examples here

    public String parseBodyToHTML(InputStream stream) throws IOException, SAXException, TikaException {
        ContentHandler handler = new BodyContentHandler(
                new ToXMLContentHandler());
    
        AutoDetectParser parser = new AutoDetectParser();
        Metadata metadata = new Metadata();
        parser.parse(stream, handler, metadata);
        return handler.toString();
    }
    

    another option would be write a JsonHandler for yourself ContentHandler