Search code examples
apache-sparkpyspark

using spark2-shell, unable to access S3 path to having ORC file to create a dataframe


I have S3 access_key_id, secret_access_key and endpoint URL.

I tried opening spar2-shell

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
  .appName("Read ORC from S3")
  .getOrCreate()

sc.hadoopConfiguration.set("fs.s3a.access.key", "ABC")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "2ju0jzWo/ABC")
sc.hadoopConfiguration.set("fs.s3a.endpoint", "Https://abc")

val df = spark.read.orc("s3a://rcemqe-24-45ae3433-0511-459e-bdaf-7f1348f9d8d0/user/rcem1403/output/mapsig/combine/rcem_map_sccp_lean_min/usecasename=rcem_map_min/finalcubebintime=1650532150/gran=FifteenMinutes/")

Getting below WARN and then nothing happens and end up saying path not found, even though the path exists.

24/04/23 16:08:26 WARN lineage.LineageWriter: Lineage directory /var/log/spark2/lineage doesn't exist or is not writable. Lineage for this application will be disabled.
24/04/23 16:08:27 WARN lineage.LineageWriter: Lineage directory /var/log/spark2/lineage doesn't exist or is not writable. Lineage for this application will be disabled.
24/04/23 16:08:27 WARN fs.FileSystem: S3FileSystem is deprecated and will be removed in future releases. Use NativeS3FileSystem or S3AFileSystem instead.
24/04/23 16:16:29 WARN streaming.FileStreamSink: Error while looking for metadata directory.

Solution

  • Step 1: Set AWS Credentials:

    • standard AWS env vars
    • automatic pickup from EC2 IAM role
    • spark-defaults.conf
    • in code, as below (warning: don't publish these secrets to any public git repo)

    Step 2: Download Required JARs:

    Scala requires specific JAR files to interact with AWS services. You can download these JARs from the Maven repository. Visit the Maven repository and search for the AWS SDK JARs compatible with your Spark version. hadoop-aws, aws-java-sdk-bundle

    Step 3: Configure Spark session

    import org.apache.spark.sql.SparkSession
    
    val spark = SparkSession.builder()
      .appName("AWS S3 Example")
      .config("spark.jars", "/path/to/hadoop-aws.jar,/path/to/aws-java-sdk-bundle.jar")
      .getOrCreate()
    

    Step 4: Set Hadoop Configuration

    val hadoopConf = spark.sparkContext.hadoopConfiguration
    
    hadoopConf.set("fs.s3a.access.key", "ABC")     
    hadoopConf.set("fs.s3a.secret.key", "2ju0jzWo/ABC")     
    hadoopConf.set("fs.s3a.endpoint", "https://abc")
    

    Note: if you are using a non standard endpoint you probably want to set fs.s3a.path.style.access to true

    Step 5: Read Data from S3:

    val df = spark.read.text("s3a://your-bucket-name/path/to/file")
    df.show()
    

    Ensure you replace placeholders such as "/path/to" and "your-bucket-name/path/to/file" with your actual paths and bucket names.

    If you have problems, look at the s3a documentation, Troubleshooting S3A and trust it more than stack overflow articles, which are often out of date and/or written by people whose own knowledge is out of date stack overflow superstition. Failing those docs: the source for spark and hadoop are invaluable.