I'm haveing this problem Py4JJavaError: An error occurred while calling o124.save. : org.postgresql.util.PSQLException: Connection to localhost:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.
When I run this PySark code on Jupyter notbook , and using docker to run everything, postgreSQL is installed in local machine (Windows).
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit, col, explode
import pyspark.sql.functions as f
spark = SparkSession.builder.appName("ETL Pipeline").config("spark.jars", "./postgresql-42.7.1.jar").getOrCreate()
df = spark.read.text("./Data/WordData.txt")
df2 = df.withColumn("splitedData", f.split("value"," "))
df3 = df2.withColumn("words", explode("splitedData"))
wordsDF = df3.select("words")
wordCount = wordsDF.groupBy("words").count()
driver = "org.postgresql.Driver"
url = "jdbc:postgresql://localhost:5432/local_database"
table = "word_count"
user = "postgres"
password = "12345"
wordCount.write.format("jdbc") \
.option("driver", driver) \
.option("url", url) \
.option("dbtable", table) \
.option("mode", "append") \
.option("user", user) \
.option("password", password) \
.save()
spark.stop()
I tried to edit postgresql.conf adding "listen_addresses = 'localhost'" and edit pg_hba.conf adding "host all all 0.0.0.0/0 md5" but it did not work for me, so I do not know what to do.
I solve this problem installing PostgreSQL on docker too (with this image https://hub.docker.com/_/postgres/ creating an container only for postgres) and creating a network between the PySpark container and the postgreSQL container with the command
docker network create my_network
,
this command for the postgres container
docker run --name postgres_container --network my_network -e POSTGRES_PASSWORD=12345 -d -p 5432:5432 postgres:latest
And this one for the Jupyter-pyspark container
docker run --name jupyter_container --network my_network -it -p 8888:8888 -v C:\home\work\path:/home/jovyan/work jupyter/pyspark-notebook:latest