Search code examples
csvamazon-web-servicesamazon-s3apache-sparkloaddata

Loading data from AWS S3 through Apache-Spark


I have written a python code to load files from Amazon Web Service (AWS) S3 through Apache-Spark. Specifically, the code creates RDD and load all csv files from the directory data in my bucket ruofan-bucket on AWS S3 using SparkContext().wholeTextFiles("s3n://ruofan-bucket/data"). The code shows below:

import os, sys, inspect

### Current directory path.
curr_dir = os.path.split(inspect.getfile(inspect.currentframe()))[0]

### Setup the environment variables
spark_home_dir = os.path.realpath(os.path.abspath(os.path.join(curr_dir, "../spark-1.4.0")))
python_dir = os.path.realpath(os.path.abspath(os.path.join(spark_home_dir, "./python")))
os.environ["SPARK_HOME"] = spark_home_dir
os.environ["PYTHONPATH"] = python_dir

### Setup pyspark directory path
pyspark_dir = os.path.realpath(os.path.abspath(os.path.join(spark_home_dir, "./python")))
sys.path.append(pyspark_dir)

### Import the pyspark
from pyspark import SparkConf, SparkContext

def main():
    ### Initialize the SparkConf and SparkContext
    conf = SparkConf().setAppName("ruofan").setMaster("local")
    sc = SparkContext(conf = conf)

    ### Create a RDD containing metadata about files in directory "data"
    datafile = sc.wholeTextFiles("s3n://ruofan-bucket/data")    ### Read data directory from S3 storage.

    ### Collect files from the RDD
    datafile.collect()


if __name__ == "__main__":
    main()

Before I run my code, I've already exported the environment variables: AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID. But when I run my code, it shows up the error:

IOError: [Errno 2] No such file or directory: 's3n://ruofan-bucket/data/test1.csv'

I'm sure I have the directory as well as the files on AWS S3, and I have no idea about the error. I really appreciate if anyone helps me solve the problem.


Solution

  • It would appear that wholeTextFiles does not work with Amazon S3.

    See:

    However, there may be differences between Hadoop versions, so don't take it as definite.