I have docker-compose.yaml
version: '3'
services:
spark-master:
build:
context: ./spark-master
ports:
- "7077:7077"
- "8080:8080"
command: /bin/bash ./spark-master-entrypoint.sh
spark-worker:
build:
context: ./spark-worker
command: /bin/bash ./spark-worker-entrypoint.sh
ports:
- "8081:8081"
- "7078:7078"
environment:
- SPARK_MASTER=spark-master:7077
- SPARK_WORKER_PORT=7078
notebook:
image: jupyter/all-spark-notebook
depends_on:
- spark-master
environment:
- SPARK_MASTER=spark-master
volumes:
- ./jupyter-notebook/notebooks:/home/jovyan
- ./jupyter-notebook/spark/conf:/opt/spark/conf/
ports:
- "8888:8888"
command: start-notebook.sh --NotebookApp.token='' --NotebookApp.password=''
It have 3 docker
I want to use notebook (jupyter notebook) interact something with spark.
my notebook
import os
from pyspark import SparkConf, SparkContext
spark_master = "spark://spark-master:7077"
conf = SparkConf().setAppName("from notebook").setMaster(spark_master)
sc = SparkContext(conf=conf).getOrCreate()
rdd = sc.parallelize([1, 2, 3, 4, 5])
transformed_rdd = rdd.map(lambda x: 3 * x)
result = transformed_rdd.values().collect()
print("Result from Spark:", result)
sc.stop()
but the notebook run and stuck at
result = transformed_rdd.values().collect()
FYI
MacBook Pro (15-inch, 2017)
Processor 3.1 GHz Quad-Core Intel Core i7
Memory 16 GB 2133 MHz LPDDR3
Docker version
❯ docker compose version
Docker Compose version v2.20.2-desktop.1
❯ docker version
Client:
Cloud integration: v1.0.35-desktop+001
Version: 24.0.5
API version: 1.43
Go version: go1.20.6
Git commit: ced0996
Built: Fri Jul 21 20:32:30 2023
OS/Arch: darwin/amd64
Context: desktop-linux
Server: Docker Desktop 4.22.1 (118664)
Engine:
Version: 24.0.5
API version: 1.43 (minimum version 1.12)
Go version: go1.20.6
Git commit: a61e2b4
Built: Fri Jul 21 20:35:45 2023
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.21
GitCommit: 3dce8eb055cbb6872793272b4f20ed16117344f8
runc:
Version: 1.1.7
GitCommit: v1.1.7-0-g860f061
docker-init:
Version: 0.19.0
GitCommit: de40ad0
for key, value in sc.getConf().getAll():
print(f'{key}: {value}')
spark.app.id: app-20230922092957-0002
spark.app.startTime: 1695374996363
spark.driver.host: 70f40e7eb82e
spark.executor.id: driver
spark.driver.port: 42001
spark.driver.extraJavaOptions: -Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false
spark.app.name: from notebook
spark.rdd.compress: True
spark.master: spark://spark-master:7077
spark.serializer.objectStreamReset: 100
spark.submit.pyFiles:
spark.submit.deployMode: client
spark.app.submitTime: 1695374996162
spark.ui.showConsoleProgress: true
spark.executor.extraJavaOptions: -Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false
What did you try?
SPARK_WORKER_PORT=7078
from random
change service name
Export worker port
limit SPARK_WORKER_MEMORY
limit SPARK_WORKER_CORES
what were you expecting?
I want my pyspark and jupyter notebook run normally.
thank Bernhard Stadler
When I change to a 32 GB machine
run not stuck
I got the error python version is not equivalent
I add
RUN apt-get update && \
apt-get install -y software-properties-common && \
add-apt-repository ppa:deadsnakes/ppa && \
apt-get update && \
apt-get install -y python3.11
RUN mkdir -p /opt/conda/bin
RUN ln -s /usr/bin/python3.11 /opt/conda/bin/python
to my DockerFile
and add
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
to my notebook file
now I got the result I want
The screenshot looks like your notebook is being executed, but you have a problem in your code snippet:
result = transformed_rdd.values().collect()
As transformed_rdd
doesn't contain tuples, you can't call values()
. Just remove the values()
call:
result = transformed_rdd.collect()