Search code examples
rhadoopamazon-s3apache-sparksparkr

Hadoop configuration in sparkR



I have some problems to configure hadoop with sparkR in order to read/write data from amazon S3.
For example these are the commands that works in pyspark (to solve the same issue):

sc._jsc.hadoopConfiguration().set("fs.s3n.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "myaccesskey")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "mysecretaccesskey")
sc._jsc.hadoopConfiguration().set("fs.s3n.endpoint", "myentrypoint")

Could anybody help me to work this out?


Solution

  • A solution closer to what you are doing with PySpark can be achieved by using callJMethod (https://github.com/apache/spark/blob/master/R/pkg/R/backend.R#L31)

    > hConf = SparkR:::callJMethod(sc, "hadoopConfiguration")
    > SparkR:::callJMethod(hConf, "set", "a", "b")
    NULL
    > SparkR:::callJMethod(hConf, "get", "a")
    [1] "b"
    

    UPDATE:

    hadoopConfiguration didn't work for me: conf worked though - presumably it's changed at some point.