I have written a python code to load files from Amazon Web Service (AWS) S3 through Apache-Spark. Specifically, the code creates RDD and load all csv files from the directory data
in my bucket ruofan-bucket
on AWS S3 using SparkContext().wholeTextFiles("s3n://ruofan-bucket/data")
. The code shows below:
import os, sys, inspect
### Current directory path.
curr_dir = os.path.split(inspect.getfile(inspect.currentframe()))[0]
### Setup the environment variables
spark_home_dir = os.path.realpath(os.path.abspath(os.path.join(curr_dir, "../spark-1.4.0")))
python_dir = os.path.realpath(os.path.abspath(os.path.join(spark_home_dir, "./python")))
os.environ["SPARK_HOME"] = spark_home_dir
os.environ["PYTHONPATH"] = python_dir
### Setup pyspark directory path
pyspark_dir = os.path.realpath(os.path.abspath(os.path.join(spark_home_dir, "./python")))
### Import the pyspark
from pyspark import SparkConf, SparkContext
def main():
### Initialize the SparkConf and SparkContext
conf = SparkConf().setAppName("ruofan").setMaster("local")
sc = SparkContext(conf = conf)
### Create a RDD containing metadata about files in directory "data"
datafile = sc.wholeTextFiles("s3n://ruofan-bucket/data") ### Read data directory from S3 storage.
### Collect files from the RDD
if __name__ == "__main__":
Before I run my code, I've already exported the environment variables: AWS_SECRET_ACCESS_KEY
. But when I run my code, it shows up the error:
IOError: [Errno 2] No such file or directory: 's3n://ruofan-bucket/data/test1.csv'
I'm sure I have the directory as well as the files on AWS S3, and I have no idea about the error. I really appreciate if anyone helps me solve the problem.
It would appear that wholeTextFiles
does not work with Amazon S3.
However, there may be differences between Hadoop versions, so don't take it as definite.