I'm trying to read a .csv
file stored in an s3 bucket, and I'm getting errors. I'm following the instructions here, but either it does not work or I am making a mistake and I'm not getting what I'm doing wrong.
Here's what I'm trying to do:
# I'm working on a SageMaker notebook instance
library(reticulate)
library(tidyverse)
sagemaker <- import('sagemaker')
sagemaker.session <- sagemaker$Session()
region <- sagemaker.session$boto_region_name
bucket <- "my-bucket"
prefix <- "data/staging"
bucket.path <- sprintf("https://s3-%s.amazonaws.com/%s", region, bucket)
role <- sagemaker$get_execution_role()
client <- sagemaker.session$boto_session$client('s3')
key <- sprintf("%s/%s", prefix, 'my_file.csv')
my.obj <- client$get_object(Bucket=bucket, Key=key)
my.df <- read_csv(my.obj$Body) # This is where it all breaks down:
##
## Error: `file` must be a string, raw vector or a connection.
## Traceback:
##
## 1. read_csv(my.obj$Body)
## 2. read_delimited(file, tokenizer, col_names = col_names, col_types = col_types,
## . locale = locale, skip = skip, skip_empty_rows = skip_empty_rows,
## . comment = comment, n_max = n_max, guess_max = guess_max,
## . progress = progress)
## 3. col_spec_standardise(data, skip = skip, skip_empty_rows = skip_empty_rows,
## . comment = comment, guess_max = guess_max, col_names = col_names,
## . col_types = col_types, tokenizer = tokenizer, locale = locale)
## 4. datasource(file, skip = skip, skip_empty_rows = skip_empty_rows,
## . comment = comment)
## 5. stop("`file` must be a string, raw vector or a connection.",
## . call. = FALSE)
When working with Python, I can read a CSV file using someting like this:
import pandas as pd
# ... Lots of boilerplate code
my_data = pd.read_csv(client.get_object(Bucket=bucket, Key=key)['Body'])
This is very similar to what I'm trying to do in R, and it works with Python... so why does it not work on R?
Can you point me in the right path?
Note: Although I could use a Python kernel for this, I'd like to stick to R, because I'm more fluent with it than with Python, at least when it comes to dataframe crunching.
I'd recommend trying the aws.s3
package instead:
https://github.com/cloudyr/aws.s3
Pretty simple - set your env variables:
Sys.setenv("AWS_ACCESS_KEY_ID" = "mykey",
"AWS_SECRET_ACCESS_KEY" = "mysecretkey",
"AWS_DEFAULT_REGION" = "us-east-1",
"AWS_SESSION_TOKEN" = "mytoken")
and then once that is out of the way:
aws.s3::s3read_using(read.csv, object = "s3://bucket/folder/data.csv")
Update: I see you're also already familiar with boto and trying to use reticulate so leaving this easy wrapper for that here: https://github.com/cloudyr/roto.s3
Looks like it has a great api for example the variable layout you're aiming to use:
download_file(
bucket = "is.rud.test",
key = "mtcars.csv",
filename = "/tmp/mtcars-again.csv",
profile_name = "personal"
)
read_csv("/tmp/mtcars-again.csv")