hadoop elasticsearch full-text-search hdfs elasticsearch-plugin

Best practices for searchable archive with hadoop with variety of documents(pdf,ppt,MS word,plain text etc.)

I have a problem in which I have a variety of documents in various formats like PDF,MS Word,PPT,plain text etc. which are stored in HDFS. I should extract the contents into elasticsearch index and build a full-text search system for the same. I've read about the ES-Hadoop. But Am little confused whether I can use mapper-attachments plugin of ES or Apache Tika in this case and whether ES-Hadoop is real time or not(in case I use it).

I'm curious that what will be the right way to extract the contents from documents to ES indexes and search the same.

Any help would be appreciated.

Sachin

Solution

Regarding your question about whether using ES mapper attachment plugin or Apache Tika. I would recommend you to use the mapper plugin as it is well integrated with Elasticsearch and will save you a lot of overhead indexing and adding meta information to the documents you are indexing.

As far as I know, ES-Hadoop do not expose streaming (real-time) API's. I am working with ES-Hadoop and Apache Spark and had to implement sort of streaming data to Elasticsearch by myself using Apache Kafka.

Hope that helps.