I have a problem in which I have a variety of documents in various formats like PDF,MS Word,PPT,plain text etc. which are stored in HDFS. I should extract the contents into elasticsearch index and build a full-text search system for the same. I've read about the ES-Hadoop. But Am little confused whether I can use mapper-attachments plugin of ES or Apache Tika in this case and whether ES-Hadoop is real time or not(in case I use it).
I'm curious that what will be the right way to extract the contents from documents to ES indexes and search the same.
Any help would be appreciated.
Sachin
Regarding your question about whether using ES mapper attachment plugin or Apache Tika. I would recommend you to use the mapper plugin as it is well integrated with Elasticsearch and will save you a lot of overhead indexing and adding meta information to the documents you are indexing.
As far as I know, ES-Hadoop do not expose streaming (real-time) API's. I am working with ES-Hadoop and Apache Spark and had to implement sort of streaming data to Elasticsearch by myself using Apache Kafka.
Hope that helps.