In Hibernate Search 6 the Apache Tika bridge has disappeared:
https://docs.jboss.org/hibernate/search/6.0/migration/html_single/#tikabridge
What is the best way to index the contents of a PDF or a Word document file now? Is there any alternative?
You could write your own bridge, as documented here.
Something like this:
public class TikaBridge implements ValueBridge<String, String> {
private final Parser parser;
public TikaBridge() {
parser = new AutoDetectParser();
}
@Override
public String toIndexedValue(String documentPath, ValueBridgeToIndexedValueContext context) {
if (value == null) {
return null;
}
try (InputStream input = Files.newInputStream(Paths.get(documentPath)) {
StringWriter writer = new StringWriter();
WriteOutContentHandler contentHandler = new WriteOutContentHandler(writer);
Metadata metadata = new Metadata();
ParseContext parseContext = new ParseContext();
parser.parse(input, contentHandler, metadata, parseContext);
return writer.toString();
}
}
}
Then implement an annotation and its processor:
@Retention(RetentionPolicy.RUNTIME)
@Target({ ElementType.METHOD, ElementType.FIELD })
@PropertyMapping(processor = @PropertyMappingAnnotationProcessorRef(
type = TikaField.Processor.class
))
@Documented
@Repeatable(TikaField.List.class)
public @interface TikaField {
String name() default "";
ContainerExtraction extraction() default @ContainerExtraction();
@Documented
@Target({ ElementType.METHOD, ElementType.FIELD })
@Retention(RetentionPolicy.RUNTIME)
@interface List {
TikaField[] value();
}
class Processor implements PropertyMappingAnnotationProcessor<TikaField> {
@Override
public void process(PropertyMappingStep mapping, TikaField annotation,
PropertyMappingAnnotationProcessorContext context) {
TikaBridge bridge = new TikaBridge();
mapping.genericField(annotation.name().isEmpty() ? null : annotation.name())
.valueBridge(bridge)
.extractors(context.toContainerExtractorPath(annotation.extraction()));
}
}
}
Then just use it on your model:
public class MyEntity {
// ...
@TikaField
String myDocument;
}
Should you need any parameters, you can add them to your annotation and pass them along to your bridge's constructor.
If you need to populate multiple fields from a single PDF/Word document, for example to index metadata as well as the document content, then you will have to implement a PropertyBridge instead: it allows populating multiple fields instead of just one. That's a bit more complicated, but similar.