Search code examples
migrationhibernate-searchapache-tika

Tika Bridge is deprecated in Hibernate Search 6. Alternatives?


In Hibernate Search 6 the Apache Tika bridge has disappeared:

https://docs.jboss.org/hibernate/search/6.0/migration/html_single/#tikabridge

What is the best way to index the contents of a PDF or a Word document file now? Is there any alternative?


Solution

  • You could write your own bridge, as documented here.

    Something like this:

    public class TikaBridge implements ValueBridge<String, String> {
        private final Parser parser;
    
        public TikaBridge() {
            parser = new AutoDetectParser();
        }
    
        @Override
        public String toIndexedValue(String documentPath, ValueBridgeToIndexedValueContext context) {
            if (value == null) {
                return null;
            }
            try (InputStream input = Files.newInputStream(Paths.get(documentPath)) {
                StringWriter writer = new StringWriter();
                WriteOutContentHandler contentHandler = new WriteOutContentHandler(writer);
                Metadata metadata = new Metadata();
                ParseContext parseContext = new ParseContext();
                parser.parse(input, contentHandler, metadata, parseContext);
                return writer.toString();
            }
        }
    }
    

    Then implement an annotation and its processor:

    @Retention(RetentionPolicy.RUNTIME) 
    @Target({ ElementType.METHOD, ElementType.FIELD }) 
    @PropertyMapping(processor = @PropertyMappingAnnotationProcessorRef( 
            type = TikaField.Processor.class
    ))
    @Documented 
    @Repeatable(TikaField.List.class) 
    public @interface TikaField {
    
        String name() default ""; 
    
        ContainerExtraction extraction() default @ContainerExtraction(); 
    
        @Documented
        @Target({ ElementType.METHOD, ElementType.FIELD })
        @Retention(RetentionPolicy.RUNTIME)
        @interface List {
            TikaField[] value();
        }
    
        class Processor implements PropertyMappingAnnotationProcessor<TikaField> { 
            @Override
            public void process(PropertyMappingStep mapping, TikaField annotation,
                    PropertyMappingAnnotationProcessorContext context) {
                TikaBridge bridge = new TikaBridge();
                mapping.genericField(annotation.name().isEmpty() ? null : annotation.name()) 
                        .valueBridge(bridge) 
                        .extractors(context.toContainerExtractorPath(annotation.extraction())); 
            }
        }
    }
    
    

    Then just use it on your model:

    public class MyEntity {
    
        // ...
    
        @TikaField
        String myDocument;
    
    }
    

    Should you need any parameters, you can add them to your annotation and pass them along to your bridge's constructor.

    If you need to populate multiple fields from a single PDF/Word document, for example to index metadata as well as the document content, then you will have to implement a PropertyBridge instead: it allows populating multiple fields instead of just one. That's a bit more complicated, but similar.