I'm trying to create a cluster on EC2. I have an account setup and validated with AWS. I have successfully downloaded and installed the segue
package and related packages and set my AWS credentials. My problem starts when I try to create a cluster and I get the following:
> library(segue)
Loading required package: rJava
Loading required package: caTools
Loading required package: bitops
Segue did not find your AWS credentials. Please run the setCredentials() function.
> setCredentials('', '') #keys hidden
> myCluster <- createCluster(numInstances=5)
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
com.amazonaws.AmazonClientException: Can't turn bucket name into a URI: Illegal character in authority at index 8: https://c:\users\backup~1\appdata\local\temp\rtmp4u0n8yqaaoducils-segue.s3.amazonaws.com
Any ideas?
acesnap, I'm the author of Segue and I can say with confidence that the issue you're running into is that the Segue
package has not been implemented to run on the Windows platform. I'm suspicious that the issue is that windows does funny things with file paths, temp files, and the like. The server side of the Segue package is always the Amazon Elastic Map Reduce service which runs Linux, but temporary files are built on the client machine and so Segue must talk nice with the local operating system.
There are several work-arounds I can think of:
Set up Virtual Box on your local machine and get Ubuntu and R installed.
Set up an EC2 machine and install R and Segue and then use that machine to fire off Segue jobs.
Buy a Mac or install Linux on a desktop machine (kinda obvious, I guess)
Even though my desktop machines are Mac and Linux, I use #2 above frequently. I do this because it speeds up the communication between the machine running Segue and the backend cluster. It also reduces the probability that the Segue main machine will lose connectivity to the EMR backend. This is valuable because if communication is lost between Segue and the amazon cloud while a job is running then the job will run on the cloud cluster, but have no way of returning results to the Segue main machine (the machine you submit jobs from).