Search code examples
elasticsearchapache-tikafscrawler

Push custom fields to metadata of PDF using fscrawler


I am using fscrawler to index PDF documents using the following command:

/usr/bin/fscrawler --config_dir /home/user1/conf test_index --restart --loop 1

The metadata of PDF is indexed. I want to add custom fields towards the metadata of PDF and index these too. I have adapted the configuration file as follows:

metadata:
  custom_field1:
    type: text
  custom_field2:
    type: keyword

How to index these custom fields together with PDF using fscrawler?


Solution

  • You can define an ingest pipeline in Elasticsearch with some set processors inside and tell FSCrawler to use this pipeline. There's an example of this in the documentation.

    Would that work for you? If not, I think we should support the indexation of specific metadata per file, by checking if a file named foo.pdf.metadata for example exists in a side folder with all the metadata files in it... I opened a feature request for it.

    Otherwise, the REST Service of FSCrawler could allow adding metadata to the files.