Search code examples
pythonlinuxdockerssl-certificatenltk

How to download NLTK package with proper security certificates inside docker container?


I have tried all combinations mentioned here and other places, but I keep getting the same error.

Here is my Dockerfile:

FROM python:3.9

RUN pip install virtualenv && virtualenv venv -p python3
ENV VIRTUAL_ENV=/venv
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

WORKDIR /app
COPY requirements.txt ./
RUN pip install -r requirements.txt

RUN git clone https://github.com/facebookresearch/detectron2.git
RUN python -m pip install -e detectron2

# Install dependencies
RUN apt-get update && apt-get install libgl1 -y
RUN pip install -U nltk
RUN [ "python3", "-c", "import nltk; nltk.download('punkt', download_dir='/usr/local/nltk_data')" ]

COPY . /app

# Run the application:
CMD ["python", "-u", "app.py"]

The docker image gets built fine (I'm using the platform argument as I'm building the image to be run inside Linux, but my local machine where I'm building the image is Windows and the detectron library doesn't get installed in Windows):

>>> docker buildx build --platform=linux/amd64 -t my_app .
[+] Building 23.2s (16/16) FINISHED
 => [internal] load .dockerignore                                                                                  0.0s
 => => transferring context: 2B                                                                                    0.0s
 => [internal] load build definition from Dockerfile                                                               0.0s
 => => transferring dockerfile: 634B                                                                               0.0s
 => [internal] load metadata for docker.io/library/python:3.9                                                      0.9s
 => [internal] load build context                                                                                  0.0s
 => => transferring context: 1.85kB                                                                                0.0s
 => [ 1/11] FROM docker.io/library/python:3.9@sha256:6ea9dafc96d7914c5c1d199f1f0195c4e05cf017b10666ca84cb7ce8e269  0.0s
 => CACHED [ 2/11] RUN pip install virtualenv && virtualenv venv -p python3                                        0.0s
 => CACHED [ 3/11] WORKDIR /app                                                                                    0.0s
 => CACHED [ 4/11] COPY requirements.txt ./                                                                        0.0s
 => CACHED [ 5/11] RUN pip install -r requirements.txt                                                             0.0s
 => CACHED [ 6/11] RUN git clone https://github.com/facebookresearch/detectron2.git                                0.0s
 => CACHED [ 7/11] RUN python -m pip install -e detectron2                                                         0.0s
 => CACHED [ 8/11] RUN apt-get update && apt-get install libgl1 -y                                                 0.0s
 => CACHED [ 9/11] RUN pip install -U nltk                                                                         0.0s
 => [10/11] RUN [ "python3", "-c", "import nltk; nltk.download('punkt', download_dir='/usr/local/nltk_data')" ]   22.1s
 => [11/11] COPY . /app                                                                                            0.0s
 => exporting to image                                                                                             0.1s
 => => exporting layers                                                                                            0.1s
 => => writing image sha256:83e2495addbc4cdf9b0885e1bb4c5b0fb0777177956eda56950bbf59c095d23b                       0.0s
 => => naming to docker.io/library/my_app

But I keep getting the error below when trying to run the image:

>>> docker run -p 8080:8080 my_app
[nltk_data] Error loading punkt: <urlopen error EOF occurred in
[nltk_data]     violation of protocol (_ssl.c:1129)>
[nltk_data] Error loading punkt: <urlopen error EOF occurred in
[nltk_data]     violation of protocol (_ssl.c:1129)>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data]     EOF occurred in violation of protocol (_ssl.c:1129)>
Traceback (most recent call last):
  File "/app/app.py", line 16, in <module>
    index = VectorstoreIndexCreator().from_loaders(loaders)
  File "/venv/lib/python3.9/site-packages/langchain/indexes/vectorstore.py", line 72, in from_loaders
    docs.extend(loader.load())
  File "/venv/lib/python3.9/site-packages/langchain/document_loaders/unstructured.py", line 70, in load
    elements = self._get_elements()
  File "/venv/lib/python3.9/site-packages/langchain/document_loaders/pdf.py", line 37, in _get_elements
    return partition_pdf(filename=self.file_path, **self.unstructured_kwargs)
  File "/venv/lib/python3.9/site-packages/unstructured/partition/pdf.py", line 75, in partition_pdf
    return partition_pdf_or_image(
  File "/venv/lib/python3.9/site-packages/unstructured/partition/pdf.py", line 137, in partition_pdf_or_image
    return _partition_pdf_with_pdfminer(
  File "/venv/lib/python3.9/site-packages/unstructured/utils.py", line 43, in wrapper
    return func(*args, **kwargs)
  File "/venv/lib/python3.9/site-packages/unstructured/partition/pdf.py", line 248, in _partition_pdf_with_pdfminer
    elements = _process_pdfminer_pages(
  File "/venv/lib/python3.9/site-packages/unstructured/partition/pdf.py", line 293, in _process_pdfminer_pages
    _elements = partition_text(text=text)
  File "/venv/lib/python3.9/site-packages/unstructured/partition/text.py", line 89, in partition_text
    elif is_possible_narrative_text(ctext):
  File "/venv/lib/python3.9/site-packages/unstructured/partition/text_type.py", line 76, in is_possible_narrative_text
    if exceeds_cap_ratio(text, threshold=cap_threshold):
  File "/venv/lib/python3.9/site-packages/unstructured/partition/text_type.py", line 273, in exceeds_cap_ratio
    if sentence_count(text, 3) > 1:
  File "/venv/lib/python3.9/site-packages/unstructured/partition/text_type.py", line 222, in sentence_count
    sentences = sent_tokenize(text)
  File "/venv/lib/python3.9/site-packages/unstructured/nlp/tokenize.py", line 38, in sent_tokenize
    return _sent_tokenize(text)
  File "/venv/lib/python3.9/site-packages/nltk/tokenize/__init__.py", line 106, in sent_tokenize
    tokenizer = load(f"tokenizers/punkt/{language}.pickle")
  File "/venv/lib/python3.9/site-packages/nltk/data.py", line 750, in load
    opened_resource = _open(resource_url)
  File "/venv/lib/python3.9/site-packages/nltk/data.py", line 876, in _open
    return find(path_, path + [""]).open()
  File "/venv/lib/python3.9/site-packages/nltk/data.py", line 583, in find
    raise LookupError(resource_not_found)
LookupError:
**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')

  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt/PY3/english.pickle

  Searched in:
    - '/root/nltk_data'
    - '/venv/nltk_data'
    - '/venv/share/nltk_data'
    - '/venv/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************

Solution

  • I disconnected my machine from WiFi and connected it to my phone's hotspot, then it runs without any error, as it is now able to download the NLTK package. Extremely weird (and silly) issue. I wonder if there's a better solution, as nothing else worked for me.