Search code examples
javapython-3.xpostgresqlpysparkconnection

Problem with postgreSQL, trying to connect with PySpark on Jupyter Notebook on Docker


I'm haveing this problem Py4JJavaError: An error occurred while calling o124.save. : org.postgresql.util.PSQLException: Connection to localhost:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections. When I run this PySark code on Jupyter notbook , and using docker to run everything, postgreSQL is installed in local machine (Windows).

from pyspark.sql import SparkSession
from pyspark.sql.functions import lit, col, explode
import pyspark.sql.functions as f

spark = SparkSession.builder.appName("ETL Pipeline").config("spark.jars", "./postgresql-42.7.1.jar").getOrCreate()
df = spark.read.text("./Data/WordData.txt")

df2 = df.withColumn("splitedData", f.split("value"," "))
df3 = df2.withColumn("words", explode("splitedData"))
wordsDF = df3.select("words")
wordCount = wordsDF.groupBy("words").count()

driver = "org.postgresql.Driver"
url = "jdbc:postgresql://localhost:5432/local_database"
table = "word_count"
user = "postgres"
password = "12345"

wordCount.write.format("jdbc") \
    .option("driver", driver) \
    .option("url", url) \
    .option("dbtable", table) \
    .option("mode", "append") \
    .option("user", user) \
    .option("password", password) \
    .save()

spark.stop()

I tried to edit postgresql.conf adding "listen_addresses = 'localhost'" and edit pg_hba.conf adding "host all all 0.0.0.0/0 md5" but it did not work for me, so I do not know what to do.


Solution

  • I solve this problem installing PostgreSQL on docker too (with this image https://hub.docker.com/_/postgres/ creating an container only for postgres) and creating a network between the PySpark container and the postgreSQL container with the command

    docker network create my_network,

    this command for the postgres container

    docker run --name postgres_container --network my_network -e POSTGRES_PASSWORD=12345 -d -p 5432:5432 postgres:latest

    And this one for the Jupyter-pyspark container

    docker run --name jupyter_container --network my_network -it -p 8888:8888 -v C:\home\work\path:/home/jovyan/work jupyter/pyspark-notebook:latest