Tika Bridge is deprecated in Hibernate Search 6. Alternatives?

In Hibernate Search 6 the Apache Tika bridge has disappeared:

https://docs.jboss.org/hibernate/search/6.0/migration/html_single/#tikabridge

What is the best way to index the contents of a PDF or a Word document file now? Is there any alternative?

Solution

You could write your own bridge, as documented here.

Something like this:

public class TikaBridge implements ValueBridge<String, String> {
    private final Parser parser;

    public TikaBridge() {
        parser = new AutoDetectParser();
    }

    @Override
    public String toIndexedValue(String documentPath, ValueBridgeToIndexedValueContext context) {
        if (value == null) {
            return null;
        }
        try (InputStream input = Files.newInputStream(Paths.get(documentPath)) {
            StringWriter writer = new StringWriter();
            WriteOutContentHandler contentHandler = new WriteOutContentHandler(writer);
            Metadata metadata = new Metadata();
            ParseContext parseContext = new ParseContext();
            parser.parse(input, contentHandler, metadata, parseContext);
            return writer.toString();
        }
    }
}

Then implement an annotation and its processor:

@Retention(RetentionPolicy.RUNTIME) 
@Target({ ElementType.METHOD, ElementType.FIELD }) 
@PropertyMapping(processor = @PropertyMappingAnnotationProcessorRef( 
        type = TikaField.Processor.class
))
@Documented 
@Repeatable(TikaField.List.class) 
public @interface TikaField {

    String name() default ""; 

    ContainerExtraction extraction() default @ContainerExtraction(); 

    @Documented
    @Target({ ElementType.METHOD, ElementType.FIELD })
    @Retention(RetentionPolicy.RUNTIME)
    @interface List {
        TikaField[] value();
    }

    class Processor implements PropertyMappingAnnotationProcessor<TikaField> { 
        @Override
        public void process(PropertyMappingStep mapping, TikaField annotation,
                PropertyMappingAnnotationProcessorContext context) {
            TikaBridge bridge = new TikaBridge();
            mapping.genericField(annotation.name().isEmpty() ? null : annotation.name()) 
                    .valueBridge(bridge) 
                    .extractors(context.toContainerExtractorPath(annotation.extraction())); 
        }
    }
}

Then just use it on your model:

public class MyEntity {

    // ...

    @TikaField
    String myDocument;

}

Should you need any parameters, you can add them to your annotation and pass them along to your bridge's constructor.

If you need to populate multiple fields from a single PDF/Word document, for example to index metadata as well as the document content, then you will have to implement a PropertyBridge instead: it allows populating multiple fields instead of just one. That's a bit more complicated, but similar.