Search code examples
dockerapache-sparkpysparkamazon-ecs

spark application run inside a docker container. I see only driver node, but no worker node


I have created a docker file with spark and running my spark jobs inside them for testing

FROM python:3.10.9-buster

###########################################
# Upgrade the packages
###########################################
# Download latest listing of available packages:
RUN apt-get -y update
# Upgrade already installed packages:
RUN apt-get -y upgrade
# Install a new package:

###########################################
# install tree package
###########################################
# Install a new package:
RUN apt-get -y install tree


#############################################
# install pipenv
############################################
ENV PIPENV_VENV_IN_PROJECT=1

# ENV PIPENV_VENV_IN_PROJECT=1 is important: it causes the resuling virtual environment to be created as /app/.venv. Without this the environment gets created somewhere surprising, such as /root/.local/share/virtualenvs/app-4PlAip0Q - which makes it much harder to write automation scripts later on.

RUN python -m pip install --upgrade pip

RUN pip install --no-cache-dir pipenv

RUN pip install --no-cache-dir jupyter

RUN pip install --no-cache-dir py4j

RUN pip install --no-cache-dir findspark


#############################################
# install java and spark and hadoop
# Java is required for scala and scala is required for Spark
############################################


# VERSIONS
ENV SPARK_VERSION=3.2.4 \
HADOOP_VERSION=3.2 \
JAVA_VERSION=11

RUN apt-get update --yes && \
    apt-get install --yes --no-install-recommends \
    "openjdk-${JAVA_VERSION}-jre-headless" \
    ca-certificates-java  \
    curl && \
    apt-get clean && rm -rf /var/lib/apt/lists/*


RUN java --version

# DOWNLOAD SPARK AND INSTALL
RUN DOWNLOAD_URL_SPARK="https://dlcdn.apache.org/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" \
    && wget --verbose -O apache-spark.tgz  "${DOWNLOAD_URL_SPARK}"\
    && mkdir -p /home/spark \
    && tar -xf apache-spark.tgz -C /home/spark --strip-components=1 \
    && rm apache-spark.tgz


# SET SPARK ENV VARIABLES
ENV SPARK_HOME="/home/spark"
ENV PATH="${SPARK_HOME}/bin/:${PATH}"

# Fix Spark installation for Java 11 and Apache Arrow library
# see: https://github.com/apache/spark/pull/27356, https://spark.apache.org/docs/latest/#downloading
RUN cp -p "${SPARK_HOME}/conf/spark-defaults.conf.template" "${SPARK_HOME}/conf/spark-defaults.conf" && \
    echo 'spark.driver.extraJavaOptions -Dio.netty.tryReflectionSetAccessible=true' >> "${SPARK_HOME}/conf/spark-defaults.conf" && \
    echo 'spark.executor.extraJavaOptions -Dio.netty.tryReflectionSetAccessible=true' >> "${SPARK_HOME}/conf/spark-defaults.conf"

############################################
# create group and user
############################################

ARG UNAME=simha
ARG UID=1000
ARG GID=1000


RUN cat /etc/passwd

# create group
RUN groupadd -g $GID $UNAME

# create a user with userid 1000 and gid 100
RUN useradd -u $UID -g $GID -m -s /bin/bash $UNAME
# -m creates home directory

# change permissions of /home/simha to 1000:100
RUN chown $UID:$GID /home/simha


###########################################
# add sudo
###########################################

RUN apt-get update --yes 
RUN apt-get -y install sudo
RUN apt-get -y install vim
RUN cat /etc/sudoers
RUN echo "$UNAME ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers
RUN cat /etc/sudoers

#############################
# spark history server
############################

# ALLOW spark history server (mount sparks_events folder locally to /home/simha/app/spark_events)

RUN echo 'spark.eventLog.enabled true' >> "${SPARK_HOME}/conf/spark-defaults.conf" && \
    echo 'spark.eventLog.dir file:///home/simha/app/spark_events' >> "${SPARK_HOME}/conf/spark-defaults.conf" && \
    echo 'spark.history.fs.logDirectory file:///home/simha/app/spark_events' >> "${SPARK_HOME}/conf/spark-defaults.conf"

RUN mkdir /home/spark/logs
RUN chown $UID:$GID /home/spark/logs

###########################################
# change working dir and user
###########################################

USER $UNAME

RUN mkdir -p /home/$UNAME/app
WORKDIR /home/$UNAME/app

and step into the docker container

hostfolder="$(pwd)"
dockerfolder="/home/simha/app"
docker run --rm -it \
  --net="host" \
  -v ${hostfolder}:${dockerfolder} \
python_spark_custom_build:latest /bin/bash

Inside this i start pyspark shell

enter image description here

so everything in running inside a container.

I check the web ui to check the executors

enter image description here

Q1. I see there is only driver and no worked node. Can driver act as worker also

Q2: How to create a cluser within my container. I want to have a setup of 1 driver and 4 worker node in this container. So that parallelization can be achieved.

I am planning to use ECS task to run my spark scripts using docker containers. I dont want to use EMR or glue.

I am fine to have one node (acting as worker and driver) given that multiple executors are running so the parallelization is achieved.

My understanding is driver and executors are the core of paralleization. Irrespective of they run in seperate node or all together is one node


Solution

  • A Single Node cluster is a cluster consisting of an Apache Spark driver and no Spark workers.

    As per the Databricks documentation. https://docs.databricks.com/clusters/single-node.html

    If you wish to create a multi node cluster say: 1 Master and 4 workers, you can refer this medium article here: https://medium.com/@MarinAgli1/setting-up-a-spark-standalone-cluster-on-docker-in-layman-terms-8cbdc9fdd14b