Search code examples
jsonapache-sparkspark-notebook

What are SparkSession Config Options


I am trying to use SparkSession to convert JSON data of a file to RDD with Spark Notebook. I already have the JSON file.

 val spark = SparkSession
   .builder()
   .appName("jsonReaderApp")
   .config("config.key.here", configValueHere)
   .enableHiveSupport()
   .getOrCreate()
val jread = spark.read.json("search-results1.json")

I am very new to spark and do not know what to use for config.key.here and configValueHere.


Solution

  • SparkSession

    To get all the "various Spark parameters as key-value pairs" for a SparkSession, “The entry point to programming Spark with the Dataset and DataFrame API," run the following (this is using Spark Python API, Scala would be very similar).

    import pyspark
    from pyspark import SparkConf
    from pyspark.sql import SparkSession
    spark = SparkSession.builder.getOrCreate()
    SparkConf().getAll()
    

    or without importing SparkConf:

    spark.sparkContext.getConf().getAll()
    

    Depending on which API you are using, see one of the following:

    1. https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/SparkSession.html
    2. https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html
    3. https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/SparkSession.html
    4. https://spark.apache.org/docs/latest/api/R/reference/sparkR.session.html

    You can get a deeper level list of SparkSession configuration options by running the code below. Most are the same, but there are a few extra ones. I am not sure if you can change these.

    spark.sparkContext._conf.getAll()  
    

    SparkContext

    To get all the "various Spark parameters as key-value pairs" for a SparkContext, the "Main entry point for Spark functionality," ... "connection to a Spark cluster," ... and "to create RDDs, accumulators and broadcast variables on that cluster,” run the following.

    import pyspark
    from pyspark import SparkConf, SparkContext 
    spark_conf = SparkConf().setAppName("test")
    spark = SparkContext(conf = spark_conf)
    SparkConf().getAll()
    

    Depending on which API you are using, see one of the following:

    1. https://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkContext.html
    2. https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.html
    3. https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html
    4. https://spark.apache.org/docs/latest/api/R/reference/sparkR.init-deprecated.html

    Spark parameters

    You should get a list of tuples that contain the "various Spark parameters as key-value pairs" similar to the following:

    [(u'spark.eventLog.enabled', u'true'),
     (u'spark.yarn.appMasterEnv.PYSPARK_PYTHON', u'/<yourpath>/parcels/Anaconda-4.2.0/bin/python'),
     ...
     ...
     (u'spark.yarn.jars', u'local:/<yourpath>/lib/spark2/jars/*')]
    

    Depending on which API you are using, see one of the following:

    1. https://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkConf.html
    2. https://spark.apache.org/docs/latest//api/python/reference/api/pyspark.SparkConf.html
    3. https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkConf.html
    4. https://spark.apache.org/docs/3.3.2/api/R/reference/sparkR.conf.html (for SparkR, sparkConfig can only be set from sparkR.session(sparkConfig=list()))

    For a complete list of Spark properties, see:
    http://spark.apache.org/docs/latest/configuration.html#viewing-spark-properties

    Setting Spark parameters

    Each tuple is ("spark.some.config.option", "some-value") which you can set in your application with:

    SparkSession

    spark = (
        SparkSession
        .builder
        .appName("Your App Name")
        .config("spark.some.config.option1", "some-value")
        .config("spark.some.config.option2", "some-value")
        .getOrCreate())
    
    sc = spark.sparkContext
    

    SparkContext

    spark_conf = (
        SparkConf()
        .setAppName("Your App Name")
        .set("spark.some.config.option1", "some-value")
        .set("spark.some.config.option2", "some-value"))
    
    sc = SparkContext(conf = spark_conf)
    

    spark-defaults

    You can also set the Spark parameters in a spark-defaults.conf file:

    spark.some.config.option1 some-value
    spark.some.config.option2 "some-value"
    

    then run your Spark application with spark-submit (pyspark):

    spark-submit \
    --properties-file path/to/your/spark-defaults.conf \
    --name "Your App Name" \
    --py-files path/to/your/supporting/pyspark_files.zip \
    --class Main path/to/your/pyspark_main.py