Search code examples
pythondockerapache-tika

Docker python tika


I like to create a Dockerfile that installs all the necessary components to run python-tika inside a Docker container.

So far this is my Dockerfile:

###Get python
FROM python:3

RUN pip3 install --upgrade pip requests
RUN pip3 install python-docx tika numpy pandas

RUN mkdir scripts

ADD runner.py /scripts/

CMD [ "python", "./scripts/runner.py" ]

I build it and run the Dockerfile:

docker build -t docker-tika .

docker run docker-tika

But it complains with the following error:

[~/Documents/BERT_DV/Docker_Parser] $ docker run docker-tika
2020-05-08 13:49:52,528 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar to /tmp/tika-server.jar.
2020-05-08 13:50:09,742 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar.md5 to /tmp/tika-server.jar.md5.
2020-05-08 13:50:10,133 [MainThread  ] [ERROR]  Unable to run java; is it installed?
2020-05-08 13:50:10,134 [MainThread  ] [ERROR]  Failed to receive startup confirmation from startServer.
2020-05-08 13:50:10,271 [MainThread  ] [ERROR]  Unable to run java; is it installed?
2020-05-08 13:50:10,271 [MainThread  ] [ERROR]  Failed to receive startup confirmation from startServer.

The runner.py script is as below:

import tika
tika.initVM()

I have following two questions: 1. I read we need tika-server jar to be downloaded 2. Call to initVM() inside python script that starts the tika-server in the backgroud.

I don't know what'm missing in the. Dockerfile. Appreciate help!

I have update Docker file with Java as well and still it's complaining about Java

### 1. Get Linux
FROM alpine:3.7

### 2. Get Java via the package manager
RUN apk update \
&& apk upgrade \
&& apk add --no-cache bash \
&& apk add --no-cache --virtual=build-dependencies unzip \
&& apk add --no-cache curl \
&& apk add --no-cache openjdk8-jre

ENV JAVA_HOME=/opt/java/openjdk \
    PATH="/opt/java/openjdk/bin:$PATH"

###3. Get ython
FROM python:3

RUN pip3 install --upgrade pip requests
RUN pip3 install python-docx tika numpy pandas

RUN mkdir scripts
RUN mkdir pdfs
RUN mkdir output

ADD runner2.py /scripts/
ADD sample.pdf .

CMD [ "python", "./scripts/runner2.py" ]

cat runner2.py:

#!/usr/bin/env python
import tika
from tika import parser
parsed = parser.from_file('sample.pdf')
print(parsed["metadata"])
print(parsed["content"])

[~/Documents/BERT_DV/Docker_Parser] $ docker run docker-tika

2020-05-08 14:40:23,183 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar to /tmp/tika-server.jar.
2020-05-08 14:41:00,480 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar.md5 to /tmp/tika-server.jar.md5.
2020-05-08 14:41:02,324 [MainThread  ] [ERROR]  Unable to run java; is it installed?
2020-05-08 14:41:02,324 [MainThread  ] [ERROR]  Failed to receive startup confirmation from startServer.

Solution

  • I don't have reputation to comment, so posting here.

    It seems, that your Dockerfile is making now multi-stage build, Java is not in the last phase anymore - previous phase gets deleted.

    As Giga Kokaia earlier and others stated, Java is the problem. It seems that you want do it with single Dockerfile. It can be achieved for example by keeping that Alpine as base image, but you will need some additional dependencies to be able to install Python and required pip packages. Alpine might not be best base for Python, when used with many libraries, as it is not using libc library. However, here is very roughly updated Dockerfile:

    ### 1. Get Linux
    FROM alpine:3.7
    
    ### 2. Get Java via the package manager
    RUN apk update \
    && apk upgrade \
    && apk add --no-cache bash \
    && apk add --no-cache --virtual=build-dependencies unzip \
    && apk add --no-cache curl \
    && apk add --no-cache openjdk8-jre \
    && apk add python3 python3-dev gcc g++ gfortran musl-dev libxml2-dev libxslt-dev
    
    ENV JAVA_HOME=/opt/java/openjdk \
        PATH="/opt/java/openjdk/bin:$PATH"
    
    
    RUN pip3 install --upgrade pip requests
    RUN pip3 install python-docx wheel tika numpy 
    RUN pip3 install pandas
    
    RUN mkdir scripts
    RUN mkdir pdfs
    RUN mkdir output
    
    ADD runner2.py /scripts/
    ADD sample.pdf .
    
    CMD [ "python3", "./scripts/runner2.py"  ]