Search code examples
javamavenapache-tika

Tika, Maven, dependencies... Why is Tika using EmptyParser?


I want to use Tika as a dependency in a Maven project, to extract metadatas from files. It's working fine when I run the class with mvn exec:java, but not with java -cp, so I suspect it is a dependency problem...

I included all the dependencies in the jar with the maven shade plugin, and at build they are included.

The pom.xml:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>org.company.myapp</groupId>
  <artifactId>metadata-extractor</artifactId>
  <packaging>jar</packaging>
  <version>1.0-SNAPSHOT</version>

  <name>Metadata Extractor</name>
  <url>http://maven.apache.org</url>

  <properties>
    <tika.version>1.19</tika.version>
  </properties>

  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>3.8.1</version>
      <scope>test</scope>
    </dependency>
    <!-- Tika -->
    <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-parsers</artifactId>
      <version>${tika.version}</version>
    </dependency>
  </dependencies>


    <build>
      <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.8.0</version>
            <configuration>
                <source>1.8</source>
                <target>1.8</target>
            </configuration>
        </plugin>
        <plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-shade-plugin</artifactId>
          <version>3.2.0</version>
          <executions>
            <execution>
              <phase>package</phase>
              <goals>
                <goal>shade</goal>
              </goals>
              <configuration>
                <minimizeJar>true</minimizeJar>
                <filters>
                  <filter>
                    <artifact>*:*</artifact>
                    <excludes>
                      <exclude>META-INF/*.SF</exclude>
                      <exclude>META-INF/*.DSA</exclude>
                      <exclude>META-INF/*.RSA</exclude>
                    </excludes>
                  </filter>
                </filters>
              </configuration>
            </execution>
          </executions>
        </plugin>
      </plugins>
    </build>

</project>

Main class:

public class App
{
    public static void main( String[] args )
    {
        // Get path
        Path path = Paths.get("/path/to/image.jpg");

        // Use Tika
        TikaConfig tikaConfig = TikaConfig.getDefaultConfig();
        Metadata metadata = new Metadata();
        AutoDetectParser parser = new AutoDetectParser(tikaConfig);
        ContentHandler handler = new BodyContentHandler(-1);

        try {
            TikaInputStream stream = TikaInputStream.get(path, metadata);
            parser.parse(stream, handler, metadata, new ParseContext());
        } catch (IOException | SAXException | TikaException e) {
            System.out.println("error: " + e.toString());
            return;
        }

        // Prints the metadata and content...
        System.out.println("Parsed Metadata: ");
        System.out.println(metadata);
        System.out.println("Parsed Text: ");
        System.out.println(handler.toString());

    }
}

Result, with mvn exec:java (working as expected):

Parsed Metadata: 
... X-Parsed-By=org.apache.tika.parser.DefaultParser X-Parsed-By=org.apache.tika.parser.jpeg.JpegParser ... other metadatas ... 
Parsed Text: 

But, with:

mvn clean package
java -cp target/metadata-extractor-1.0-SNAPSHOT.jar org.company.myapp.App

I got:

Parsed Metadata: 
X-Parsed-By=org.apache.tika.parser.EmptyParser resourceName=image.jpg Content-Length=1557172 Content-Type=image/jpeg
Parsed Text:

What am I doing wrong? How do I have to build the project for it to correctly autodetect the parser?

Thanks.


Solution

  • There is no parser in your classpath so EmptyParser is chosen. I think the problem is in shade plugin. Remove this line:

    <minimizeJar>true</minimizeJar>
    

    And add these dependencies with proper version:

     <dependency>
         <groupId>org.apache.pdfbox</groupId>
         <artifactId>jbig2-imageio</artifactId>
     </dependency>
     <dependency>
         <groupId>com.github.jai-imageio</groupId>
         <artifactId>jai-imageio-core</artifactId>
     </dependency>
     <dependency>
         <groupId>com.github.jai-imageio</groupId>
         <artifactId>jai-imageio-jpeg2000</artifactId>
     </dependency>