I have tried all combinations mentioned here and other places, but I keep getting the same error.
Here is my Dockerfile
:
FROM python:3.9
RUN pip install virtualenv && virtualenv venv -p python3
ENV VIRTUAL_ENV=/venv
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
WORKDIR /app
COPY requirements.txt ./
RUN pip install -r requirements.txt
RUN git clone https://github.com/facebookresearch/detectron2.git
RUN python -m pip install -e detectron2
# Install dependencies
RUN apt-get update && apt-get install libgl1 -y
RUN pip install -U nltk
RUN [ "python3", "-c", "import nltk; nltk.download('punkt', download_dir='/usr/local/nltk_data')" ]
COPY . /app
# Run the application:
CMD ["python", "-u", "app.py"]
The docker image gets built fine (I'm using the platform argument as I'm building the image to be run inside Linux, but my local machine where I'm building the image is Windows and the detectron
library doesn't get installed in Windows):
>>> docker buildx build --platform=linux/amd64 -t my_app .
[+] Building 23.2s (16/16) FINISHED
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 634B 0.0s
=> [internal] load metadata for docker.io/library/python:3.9 0.9s
=> [internal] load build context 0.0s
=> => transferring context: 1.85kB 0.0s
=> [ 1/11] FROM docker.io/library/python:3.9@sha256:6ea9dafc96d7914c5c1d199f1f0195c4e05cf017b10666ca84cb7ce8e269 0.0s
=> CACHED [ 2/11] RUN pip install virtualenv && virtualenv venv -p python3 0.0s
=> CACHED [ 3/11] WORKDIR /app 0.0s
=> CACHED [ 4/11] COPY requirements.txt ./ 0.0s
=> CACHED [ 5/11] RUN pip install -r requirements.txt 0.0s
=> CACHED [ 6/11] RUN git clone https://github.com/facebookresearch/detectron2.git 0.0s
=> CACHED [ 7/11] RUN python -m pip install -e detectron2 0.0s
=> CACHED [ 8/11] RUN apt-get update && apt-get install libgl1 -y 0.0s
=> CACHED [ 9/11] RUN pip install -U nltk 0.0s
=> [10/11] RUN [ "python3", "-c", "import nltk; nltk.download('punkt', download_dir='/usr/local/nltk_data')" ] 22.1s
=> [11/11] COPY . /app 0.0s
=> exporting to image 0.1s
=> => exporting layers 0.1s
=> => writing image sha256:83e2495addbc4cdf9b0885e1bb4c5b0fb0777177956eda56950bbf59c095d23b 0.0s
=> => naming to docker.io/library/my_app
But I keep getting the error below when trying to run the image:
>>> docker run -p 8080:8080 my_app
[nltk_data] Error loading punkt: <urlopen error EOF occurred in
[nltk_data] violation of protocol (_ssl.c:1129)>
[nltk_data] Error loading punkt: <urlopen error EOF occurred in
[nltk_data] violation of protocol (_ssl.c:1129)>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data] EOF occurred in violation of protocol (_ssl.c:1129)>
Traceback (most recent call last):
File "/app/app.py", line 16, in <module>
index = VectorstoreIndexCreator().from_loaders(loaders)
File "/venv/lib/python3.9/site-packages/langchain/indexes/vectorstore.py", line 72, in from_loaders
docs.extend(loader.load())
File "/venv/lib/python3.9/site-packages/langchain/document_loaders/unstructured.py", line 70, in load
elements = self._get_elements()
File "/venv/lib/python3.9/site-packages/langchain/document_loaders/pdf.py", line 37, in _get_elements
return partition_pdf(filename=self.file_path, **self.unstructured_kwargs)
File "/venv/lib/python3.9/site-packages/unstructured/partition/pdf.py", line 75, in partition_pdf
return partition_pdf_or_image(
File "/venv/lib/python3.9/site-packages/unstructured/partition/pdf.py", line 137, in partition_pdf_or_image
return _partition_pdf_with_pdfminer(
File "/venv/lib/python3.9/site-packages/unstructured/utils.py", line 43, in wrapper
return func(*args, **kwargs)
File "/venv/lib/python3.9/site-packages/unstructured/partition/pdf.py", line 248, in _partition_pdf_with_pdfminer
elements = _process_pdfminer_pages(
File "/venv/lib/python3.9/site-packages/unstructured/partition/pdf.py", line 293, in _process_pdfminer_pages
_elements = partition_text(text=text)
File "/venv/lib/python3.9/site-packages/unstructured/partition/text.py", line 89, in partition_text
elif is_possible_narrative_text(ctext):
File "/venv/lib/python3.9/site-packages/unstructured/partition/text_type.py", line 76, in is_possible_narrative_text
if exceeds_cap_ratio(text, threshold=cap_threshold):
File "/venv/lib/python3.9/site-packages/unstructured/partition/text_type.py", line 273, in exceeds_cap_ratio
if sentence_count(text, 3) > 1:
File "/venv/lib/python3.9/site-packages/unstructured/partition/text_type.py", line 222, in sentence_count
sentences = sent_tokenize(text)
File "/venv/lib/python3.9/site-packages/unstructured/nlp/tokenize.py", line 38, in sent_tokenize
return _sent_tokenize(text)
File "/venv/lib/python3.9/site-packages/nltk/tokenize/__init__.py", line 106, in sent_tokenize
tokenizer = load(f"tokenizers/punkt/{language}.pickle")
File "/venv/lib/python3.9/site-packages/nltk/data.py", line 750, in load
opened_resource = _open(resource_url)
File "/venv/lib/python3.9/site-packages/nltk/data.py", line 876, in _open
return find(path_, path + [""]).open()
File "/venv/lib/python3.9/site-packages/nltk/data.py", line 583, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt')
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt/PY3/english.pickle
Searched in:
- '/root/nltk_data'
- '/venv/nltk_data'
- '/venv/share/nltk_data'
- '/venv/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
**********************************************************************
I disconnected my machine from WiFi and connected it to my phone's hotspot, then it runs without any error, as it is now able to download the NLTK package. Extremely weird (and silly) issue. I wonder if there's a better solution, as nothing else worked for me.