Search code examples
apache-sparkpysparkamazon-emramazon-sagemaker

getting error while trying to read athena table in spark


I have the following code snippet in pyspark:

import pandas as pd
from pyspark import SparkContext, SparkConf
from pyspark.context import SparkContext
from pyspark.sql import Row, SQLContext, SparkSession
import pyspark.sql.dataframe

def validate_data():
    conf = SparkConf().setAppName("app")
    spark = SparkContext(conf=conf)
    config = {
    "val_path" : "s3://forecasting/data/validation.csv"
    }

    data1_df = spark.read.table("db1.data_dest”)
    data2_df = spark.read.table("db2.data_source”)
    print(data1_df.count())
    print(data2_df.count())


if __name__ == "__main__":
    validate_data()

Now this code works fine when run on jupyter notebook on sagemaker ( connecting to EMR )

but when we are running as a python script on terminal, its throwing this error

Error message

AttributeError: 'SparkContext' object has no attribute 'read'

We have to automate these notebooks, so we are trying to convert them to python scripts


Solution

  • You can only call read on a Spark Session, not on a Spark Context.

    from pyspark import SparkContext, SparkConf
    from pyspark.sql import SparkSession
    
    conf = SparkConf().setAppName("app")
    spark = SparkSession.builder.config(conf=conf)
    

    Or you can convert the Spark context to a Spark session

    conf = SparkConf().setAppName("app")
    sc = SparkContext(conf=conf)
    spark = SparkSession(sc)