Search code examples
pythonnode.jsdockerkubernetesscrapy

ScrapyRT Port Unreachable in Kubernetes Docker Container Pod


I'm experiencing difficulties in accessing a ScrapyRT service running on specific ports within a Kubernetes pod. My setup includes a Kubernetes cluster with a pod running a Scrapy application, which uses ScrapyRT to listen for incoming requests on designated ports. These requests are intended to trigger spiders on the corresponding ports.

Despite correctly setting up a Kubernetes service and referencing the Scrapy pod in it, I'm unable to receive any incoming requests to the pod. My understanding is that in Kubernetes networking, a service should be created first, followed by the pod, allowing inter-pod communication and external access through the service. Is this correct?

Below are the relevant configurations:

scrapy-pod Dockerfile:

# Use Ubuntu as the base image
FROM ubuntu:latest

# Avoid prompts from apt
ENV DEBIAN_FRONTEND=noninteractive

# # Update package repository and install Python, pip, and other utilities
RUN apt-get update && \
    apt-get install -y curl software-properties-common iputils-ping net-tools dnsutils vim build-essential python3 python3-pip && \
    rm -rf /var/lib/apt/lists/*


# Install nvm (Node Version Manager) - EXPRESS
ENV NVM_DIR /usr/local/nvm
ENV NODE_VERSION 16.20.1

RUN mkdir -p $NVM_DIR
RUN curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.1/install.sh | bash

# Install Node.js and npm - EXPRESS
RUN . "$NVM_DIR/nvm.sh" && nvm install $NODE_VERSION && nvm alias default $NODE_VERSION && nvm use default

# Add Node and npm to path so the commands are available - EXPRESS
ENV NODE_PATH $NVM_DIR/versions/node/v$NODE_VERSION/lib/node_modules
ENV PATH $NVM_DIR/versions/node/v$NODE_VERSION/bin:$PATH

# Install Yarn - EXPRESS
RUN npm install --global yarn

# Set the working directory in the container to /usr/src/app
WORKDIR /usr/src/app

# Copy the current directory contents into the container at /usr/src/app
COPY . .

# Install any needed packages specified in requirements.txt
RUN pip3 install --no-cache-dir -r requirements.txt

# Copy the start_services.sh script into the container
COPY start_services.sh /start_services.sh

# Make the script executable
RUN chmod +x /start_services.sh


# Install any needed packages specified in package.json using Yarn - EXPRESS
RUN yarn install


# Expose all the necessary ports
EXPOSE 14805 14807 12085 14806 13905 12080 14808 8000


# Define environment variable - EXPRESS
ENV NODE_ENV production

# Run the script when the container starts
CMD ["/start_services.sh"]

start_services.sh:

#!/bin/bash

# Start ScrapyRT instances on different ports
scrapyrt -p 14805 &
scrapyrt -p 14807 &
scrapyrt -p 12085 &
scrapyrt -p 14806 &
scrapyrt -p 13905 &
scrapyrt -p 12080 &
scrapyrt -p 14808 &

# Keep the container running since the ScrapyRT processes are in the background
tail -f /dev/null

service yaml file:

apiVersion: v1
kind: Service
metadata:
  name: scrapy-service
spec:
  selector:
    app: scrapy-pod
  ports:
    - name: port-14805
      protocol: TCP
      port: 14805
      targetPort: 14805
    - name: port-14807
      protocol: TCP
      port: 14807
      targetPort: 14807
    - name: port-12085
      protocol: TCP
      port: 12085
      targetPort: 12085
    - name: port-14806
      protocol: TCP
      port: 14806
      targetPort: 14806
    - name: port-13905
      protocol: TCP
      port: 13905
      targetPort: 13905
    - name: port-12080
      protocol: TCP
      port: 12080
      targetPort: 12080
    - name: port-14808
      protocol: TCP
      port: 14808
      targetPort: 14808
    - name: port-8000
      protocol: TCP
      port: 8000
      targetPort: 8000
  type: ClusterIP

deployment yaml file:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: scrapy-deployment
  labels:
    app: scrapy-pod
spec:
  replicas: 1
  selector:
    matchLabels:
      app: scrapy-pod
  template:
    metadata:
      labels:
        app: scrapy-pod
    spec:
      containers:
      - name: scrapy-pod
        image: mydockerhub/privaterepository-scrapy:latest
        imagePullPolicy: Always  
        ports:
        - containerPort: 14805
        - containerPort: 14806
        - containerPort: 14807
        - containerPort: 12085
        - containerPort: 13905
        - containerPort: 12080
        - containerPort: 8000
        envFrom:
        - secretRef:
            name: scrapy-env-secret
        - secretRef:
            name: express-env-secret
      imagePullSecrets:
      - name: my-docker-credentials 

scrapy-pod's logs in Powershell terminal:

> k logs scrapy-deployment-56b9d66858-p59gs -f
2024-01-09 21:53:27+0000 [-] Log opened.
2024-01-09 21:53:27+0000 [-] Log opened.
2024-01-09 21:53:27+0000 [-] Log opened.
2024-01-09 21:53:27+0000 [-] Log opened.
2024-01-09 21:53:27+0000 [-] Log opened.
2024-01-09 21:53:27+0000 [-] Log opened.
2024-01-09 21:53:27+0000 [-] Log opened.
2024-01-09 21:53:27+0000 [-] Site starting on 12080
2024-01-09 21:53:27+0000 [-] Site starting on 14808
2024-01-09 21:53:27+0000 [-] Site starting on 14805
2024-01-09 21:53:27+0000 [-] Starting factory <twisted.web.server.Site object at 0x7f4cbdf44d60>
2024-01-09 21:53:27+0000 [-] Starting factory <twisted.web.server.Site object at 0x7fef9b620a00>
2024-01-09 21:53:27+0000 [-] Site starting on 13905
2024-01-09 21:53:27+0000 [-] Running with reactor: AsyncioSelectorReactor.
2024-01-09 21:53:27+0000 [-] Site starting on 14807
2024-01-09 21:53:27+0000 [-] Starting factory <twisted.web.server.Site object at 0x7f0892ff4df0>
2024-01-09 21:53:27+0000 [-] Site starting on 14806
2024-01-09 21:53:27+0000 [-] Starting factory <twisted.web.server.Site object at 0x7f00d3b99000>
2024-01-09 21:53:27+0000 [-] Starting factory <twisted.web.server.Site object at 0x7fba9e321180>
2024-01-09 21:53:27+0000 [-] Running with reactor: AsyncioSelectorReactor.
2024-01-09 21:53:27+0000 [-] Starting factory <twisted.web.server.Site object at 0x7f1782514f10>
2024-01-09 21:53:27+0000 [-] Running with reactor: AsyncioSelectorReactor.
2024-01-09 21:53:27+0000 [-] Running with reactor: AsyncioSelectorReactor.
2024-01-09 21:53:27+0000 [-] Site starting on 12085
2024-01-09 21:53:27+0000 [-] Starting factory <twisted.web.server.Site object at 0x7fb2054cd060>
2024-01-09 21:53:27+0000 [-] Running with reactor: AsyncioSelectorReactor.
2024-01-09 21:53:27+0000 [-] Running with reactor: AsyncioSelectorReactor.
2024-01-09 21:53:27+0000 [-] Running with reactor: AsyncioSelectorReactor.

Issue: Despite these configurations, no requests seem to reach the Scrapy pod. Logs from kubectl logs show that ScrapyRT instances start successfully on the specified ports. However, when I send requests from a separate debug pod running a Python Jupyter Notebook, they succeed for other pods but not for the Scrapy pod.

Question: How can I successfully connect to the Scrapy pod? What might be preventing the requests from reaching it?

Any insights or suggestions would be greatly appreciated.

Repair Attempts And Results

Milind's Suggestions

  • verify that selector field in the service YAML (scrapy-service) matches the labels in the deployment YAML (scrapy-deployment). The labels should be the same to correctly select the pods. Yes, the selector field in the service yaml matches the labels in the deployment yaml.
scrapy-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: scrapy-service
spec:
  selector:
    app: scrapy-pod
  ports:
    - protocol: TCP
      port: 14805
      targetPort: 14805
  type: ClusterIP
scrapy-service.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: scrapy-deployment
  labels:
    app: scrapy-pod
spec:
  replicas: 1
  selector:
    matchLabels:
      app: scrapy-pod
  template:
    metadata:
      labels:
        app: scrapy-pod
    spec:
      containers:
      - name: scrapy-pod
...
  • Did you check in the logs to see if there are any error messages or indications that the requests are being received????? Yes, I checked the logs but I get no indication the requests are being received. Here's the series of steps I do to check this.

Get all the pods:

> k get po
NAME                                         READY   STATUS    RESTARTS   AGE
express-app-deployment-545f899f88-zq58r      1/1     Running   0          2d8h
jupyter-debug-pod                            1/1     Running   0          31h
scrapy-deployment-56b9d66858-wfhpk           1/1     Running   0          31h

Get all the pods and show their IP:

> k get po -o wide
NAME                                         READY   STATUS    RESTARTS   AGE    IP             NODE                   NOMINATED NODE   READINESS GATES
express-app-deployment-545f899f88-zq58r      1/1     Running   0          2d8h   10.244.0.191   pool-6snxmm4o8-xd7ds   <none>           <none>
jupyter-debug-pod                            1/1     Running   0          31h    10.244.1.14    pool-6snxmm4o8-xz05i   <none>           <none>
scrapy-deployment-56b9d66858-wfhpk           1/1     Running   0          31h    10.244.1.96    pool-6snxmm4o8-xz05i   <none>           <none>

Check the scrapy-deployment logs:

> k logs scrapy-deployment-56b9d66858-wfhpk -f
2024-01-13 23:55:55+0000 [-] Log opened.
2024-01-13 23:55:55+0000 [-] Site starting on 14805
2024-01-13 23:55:55+0000 [-] Starting factory <twisted.web.server.Site object at 0x7f6b6fe04460>
2024-01-13 23:55:55+0000 [-] Running with reactor: AsyncioSelectorReactor.

Check the services:

> k get svc
NAME                      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)     AGE
express-backend-service   ClusterIP   10.245.59.90     <none>        80/TCP      9d
scrapy-service            ClusterIP   10.245.129.89    <none>        14805/TCP   31h

In a separate terminal, I exec into the jupyter-debug-pod:

> k exec -it scrapy-deployment-56b9d66858-wfhpk -- /bin/bash
root@scrapy-deployment-56b9d66858-wfhpk:/usr/src/app#

nslookup scrapy-service:

# nslookup scrapy-service
Server:         10.245.0.10
Address:        10.245.0.10#53

Name:   scrapy-service.default.svc.cluster.local
Address: 10.245.129.89

So, it SEES scrapy-service AND the 10.245.0.10 which I don't see mentioned previously.

When I curl express-backend-service, it works as expected:

# curl express-backend-service
<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>HTTP Video Stream</title>
  </head>
  <body>
    <video id="videoPlayer" width="650" controls muted="muted" autoplay>
      <source src="/video/play" type="video/mp4" />
    </video>
  </body>
</html>

But when I curl scrapy-service it just hangs then fails:

#curl scrapy-service
curl: (28) Failed to connect to scrapy-service port 80 after 130552 ms: Connection timed out

Even when I try adding the 14805 port it still fails:

# curl scrapy-service:14805
curl: (7) Failed to connect to scrapy-service port 14805 after 6 ms: Connection refused
  • Did you verified that the DNS resolution is working within the cluster and the name (scrapy-service) can be resolved?????

Yes, the scrapy-service is successfully resolving to an internal cluster IP address (10.245.129.89).

  • Did you verified that if there are any firewall rules that might be blocking traffic between pods within the cluster?????

I checked my Digital Ocean control panel's firewall settings and saw for Outbound Rules, all ports were set up. However, I did notice that for Inbound Rules, I had nothing set up. Perhaps this was the issue? I immediately set up 2 rules, one for TCP (All ports/All IPv4/All IPv6) and the same for UDP and ICMP. However, after making the changes, deleting the service and deployment, then recreating the service and deployment from scratch, it still did not solve the issue.

  • Did you tried ping or telnet to check connectivity between pod and cluster????

Yeah tried that, it failed too.

Here's the result of telnet:

# telnet scrapy-service.default.svc.cluster.local
Trying 10.245.24.22...
telnet: Unable to connect to remote host: Connection timed out
root@jupyter-debug-pod:/# telnet scrapy-service.default.svc.cluster.local 14805
Trying 10.245.24.22...
telnet: Unable to connect to remote host: Connection refused

Here's the result of ping:

# ping scrapy-service.default.svc.cluster.local
PING scrapy-service.default.svc.cluster.local (10.245.24.22) 56(84) bytes of data.

--- scrapy-service.default.svc.cluster.local ping statistics ---
1295 packets transmitted, 0 received, 100% packet loss, time 1325042ms
  • I can see that The scrapy-service is of type ClusterIP, which means it's an internal service. This wont work if you need external access.Double check it pls.Try changing it to NodePort or LoadBalancer to gain external access.

Ok, I changed scrapy-service.yaml to NodePort like so:

apiVersion: v1
kind: Service
metadata:
  name: scrapy-service
spec:
  selector:
    app: scrapy-pod
  ports:
    - protocol: TCP
      port: 14805
      targetPort: 14805
  type: NodePort

After, I tried to do curl scrapy-service (after deleting and restarting the service):

# curl scrapy-service
curl: (28) Failed to connect to scrapy-service port 80 after 129976 ms: Connection timed out

This too failed.

  • Lastly, verify if the pod is running.
> k logs scrapy-deployment-56b9d66858-6xs9r -f
2024-01-15 07:33:04+0000 [-] Log opened.
2024-01-15 07:33:04+0000 [-] Site starting on 14805
2024-01-15 07:33:04+0000 [-] Starting factory <twisted.web.server.Site object at 0x7f51f08fce20>
2024-01-15 07:33:04+0000 [-] Running with reactor: AsyncioSelectorReactor.

As you can see above, the pod is running and gives logs.

And so, now you can see my frustration with this after over a week being unable to solve this. There is another pod, express-app-deployment-545f899f88-zq58r which does NOT behave like this. It runs an Express.js app on port 8000 and the service for that, express-backend-service, works as expected.


Solution

  • After struggling for two weeks with this issue I finally found the solution!

    The problem was my ScrapyRT service in a Kubernetes pod was not accessible from other pods within the cluster. Despite correct service and deployment configurations, all attempts to connect to the ScrapyRT service were failing.

    The solution was to modify the command used to start ScrapyRT in the Dockerfile. By adding the arguments -i 0.0.0.0, I instructed ScrapyRT to listen on all network interfaces, making it reachable from other pods in the cluster.

    Updated Dockerfile Command:

    # Start a single ScrapyRT instance on the exposed port
    CMD ["scrapyrt", "-p", "14805", "-i", "0.0.0.0"]
    

    Prior Dockerfile Command (which fails because it is missing "-i", "0.0.0.0"):

    # Start a single ScrapyRT instance on the exposed port
    CMD ["scrapyrt", "-p", "14805"]
    

    Explanation:

    • The -p flag specifies the port on which ScrapyRT should listen (14805 in my case).
    • The -i 0.0.0.0 argument tells ScrapyRT to listen on all available network interfaces inside the container, not just the localhost. This change is crucial for services within a Docker container to be accessible from outside the container, especially in a Kubernetes environment.

    After making that one change to the Dockerfile and redeploying it onto the cluster as well as restarting the service, it successfully began receiving requests from outside the pod since it was no longer binding to localhost.