Search code examples
apache-sparkpyspark

Spark session behaviour using getOrCreate()


According to the Spark documentation, the getOrCreate method, as described in https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/SparkSession.Builder.html#getOrCreate-- , behaves as follows: getOrCreate Gets an existing SparkSession or, if there is no existing one, creates a new one based on the options set in this builder.

Lets assume we have 2 spark applications running in parallel. Application-2 was created after some time post creation of application-1.

Run application-1:

from pyspark.sql import SparkSession

# Create a SparkSession-1 and run
spark = SparkSession.builder.appName("my-application-1").config("spark.sql.shuffle.partitions",20).getOrCreate()

Run application-2: created while application-1 is till running

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("my-application-2").config("spark.sql.shuffle.partitions",30).getOrCreate()

Query:

  • (1) Does application-2 reuses the spark session already created by application-1?
  • (2) At times I have seen spark notifying "WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect". If "my-application-2" reuses the existing Spark session created by "my-application-1”, what happens to shuffle partitions? Does "session-1" used by "application-2" get updated from 20 to 30, or does "session-1" used by "application-2" ignores the shuffle partition setting of 30 and continue to use 20?

Solution

  • If you are working with spark-shell or notebooks, then the Spark session reuse can occur.

    With spark-submit you will always get a new session.