Search code examples
elasticsearchlogstashfscrawler

Merging data from different sources at index time


I have two file crawler jobs running separately on data which are related to each other using fscrawler(https://github.com/dadoonet/fscrawler). Now I want to in some way merge the data together when indexing(child-parent relation or flat document is OK), so some middleware is needed. Looking at both Logstash and the new Ingest Node feature in ES 5.0, none seem to support writing custom processors.

Are there any possibilities to do this sort of merging/relational mapping at index time? Or do I have to do post-processing instead?

EDIT: One job crawls "articles" in json-format. Articles can have multiple attachments (declared in an attachment array in the json), in a different location. The second job crawls the actual attachments(e.g pdf...), applying TIKA processing on it. In the end I would like to have one article type, which also contains the content of the attachments.


Solution

  • If you loaded both documents into different ES indexes, you could have an LS input that would look for articles that didn't (yet) contain the content of the attachment. For those documents, you could query the other elasticsearch index (see the elasticsearch{} filter in LS) and update the article document.