There is a PDF documents that has attachments (here: joboptions) that should not be extracted by Tika. The contents should not be sent to Solr. Is there any way to exclude certain (or all) PDF attachments in the Tika config?
Implement a custom org.apache.tika.extractor.DocumentSelector
and set it at the ParseContext
. The DocumentSelector is called with metadata of the embedded document to decide whether the embedded document should be parsed.
Example DocumentSelector:
public class CustomDocumentSelector implements DocumentSelector {
@Override
public boolean select(Metadata metadata) {
String resourceName = metadata.get(Metadata.RESOURCE_NAME_KEY);
return resourceName == null || !resourceName.endsWith(".joboptions");
}
}
Register it at the ParseContext:
parseContext.set(DocumentSelector.class, new CustomDocumentSelector());