amazon-web-services amazon-s3 hdf5 aws.s3

How to inspect/read an .h5 file stored remotely in a AWS bucket?

I have decided to read/copy files straight from their online repository to avoid download the files at first. Given this is my first attempt at this, this's been my first interaction with aws.s3 .

First, just to make sure I could run something simple, I checked if the bucket existed. I did so with bucket_exists defining both the bucket and the region. The bucket does exist.

However, the file I want to inspect is an .h5 file. To work with it, I got the rhdf5 library from BiocManager. Then, to inspect the one file, I did the following:

s3read_using(
     FUN = rhdf5::H5Fopen, 
     bucket = "s3://arpa-e-perform/ERCOT/",
     region = "us-west-2",
     object = "s3://arpa-e-perform/ERCOT/2018/Solar/Actuals/BA_level/BA_solar_actuals_2018.h5")

Unfortunately, it didn't work. The message and the error message I got follow:

List of 6
$ Code     : chr "PermanentRedirect"
$ Message  : chr "The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future "| truncated
$ Endpoint : chr "arpa-e-perform.s3.amazonaws.com"
$ Bucket   : chr "arpa-e-perform"
$ RequestId: chr "BGEZ97HJH10KAPRE"
$ HostId   : chr "pxKXcYNLchSYTwEaPLDoFRo11qkWontw+kWAtb8ZqTTEYwTptAkSgl8dbJoI8a2URXIxDCOE7/g="
- attr(*, "headers")=List of 7
..$ x-amz-bucket-region: chr "us-west-2"
..$ x-amz-request-id   : chr "BGEZ97HJH10KAPRE"
..$ x-amz-id-2         : chr "pxKXcYNLchSYTwEaPLDoFRo11qkWontw+kWAtb8ZqTTEYwTptAkSgl8dbJoI8a2URXIxDCOE7/g="
..$ content-type       : chr "application/xml"
..$ transfer-encoding  : chr "chunked"
..$ date               : chr "Mon, 06 Jun 2022 17:35:35 GMT"
..$ server             : chr "AmazonS3"
..- attr(*, "class")= chr [1:2] "insensitive" "list"
- attr(*, "class")= chr "aws_error"
NULL
Error in parse_aws_s3_response(r, Sig, verbose = verbose) :
Moved Permanently (HTTP 301).

Today's been my first interaction with aws.s3 and I'm still going through the manual/forums, so all help will be appreciated. Thank you.

Solution

I think the problem here is that you're not access the file at the correct location. The error message says "The bucket you are attempting to access must be addressed using the specified endpoint" and then provides the 'endpoint' as "arpa-e-perform.s3.amazonaws.com", which looks much more like a regular http URL.

Here's an example of reading the meta dataset from the file using rhdf5.

library(rhdf5)

## Create file access property list for reading from S3
## Credentials are NULL as this is a public bucket
fapl <- H5Pcreate("H5P_FILE_ACCESS")
H5Pset_fapl_ros3(fapl, s3credentials = NULL)

## Open file and the meta dataset
fid <- H5Fopen(name = "https://arpa-e-perform.s3.amazonaws.com/ERCOT/2018/Solar/Actuals/BA_level/BA_solar_actuals_2018.h5", flags = "H5F_ACC_RDONLY", fapl = fapl)
did <- H5Dopen(fid, name = "/meta")

## read the dataset
meta <- H5Dread(did)

## tidy up
H5Dclose(did)
H5Pclose(fapl)
H5Fclose(fid)

## Here's the output
head(meta)
#>          site_ids AC_capacity_MW module_type dc_ac_ratio azimuth latitude
#> 1              BA             BA          BA          BA      BA       BA
#> 2 Adamstown Solar            250           0        1.25     180    33.25
#> 3     Agate Solar             60           0         1.3     180    32.45
#> 4  Angelina Solar            150           0         1.4     180    31.37
#> 5    Angelo Solar            195           2        1.25     180    31.41
#> 6     Angus Solar            113           0        1.25     180    31.69
#>   longitude elevation timezone        country  state     county urban
#> 1        BA        BA       BA             BA     BA         BA    BA
#> 2    -97.26    220.16       -6 bUnited States bTexas    bDenton bNone
#> 3    -97.18    217.84       -6 bUnited States bTexas   bJohnson bNone
#> 4    -94.86        85       -6 bUnited States bTexas  bAngelina bNone
#> 5   -100.58    623.72       -6 bUnited States bTexas bTom Green bNone
#> 6    -97.26    140.72       -6 bUnited States bTexas  bMcLennan bNone
#>   population landcover    gid reV_tech proposed          Zone   ISO
#> 1         BA        BA     BA       BA       BA            BA    BA
#> 2        438       140 690482      bpv Proposed         NORTH ERCOT
#> 3       2105       140 692563      bpv Proposed NORTH CENTRAL ERCOT
#> 4        183        50 744853      bpv Proposed          EAST ERCOT
#> 5         32        30 600558      bpv Proposed          WEST ERCOT
#> 6        715       140 690817      bpv Proposed NORTH CENTRAL ERCOT