Search code examples
apihadoopgoogle-hadoop

How to create a directory in HDFS on Google Cloud Platform via Java API


I am running an Hadoop Cluster on Google Cloud Platform, using Google Cloud Storage as backend for persistent data. I am able to ssh to the master node from a remote machine and run hadoop fs commands. Anyway when I try to execute the following code I get a timeout error.

Code

FileSystem hdfs =FileSystem.get(new URI("hdfs://mymasternodeip:8020"),new Configuration());
Path homeDir=hdfs.getHomeDirectory();
//Print the home directory
System.out.println("Home folder: " +homeDir); 

// Create a directory
Path workingDir=hdfs.getWorkingDirectory();
Path newFolderPath= new Path("/DemoFolder");

newFolderPath=Path.mergePaths(workingDir, newFolderPath);
if(hdfs.exists(newFolderPath))
    {
        hdfs.delete(newFolderPath, true); //Delete existing Directory
    }
//Create new Directory
hdfs.mkdirs(newFolderPath); 

When executing the hdfs.exists() command I get a timeout error.

Error

org.apache.hadoop.net.ConnectTimeoutException: Call From gl051-win7/192.xxx.1.xxx to 111.222.333.444.bc.googleusercontent.com:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=111.222.333.444.bc.googleusercontent.com/111.222.333.444:8020]

Are you aware of any limitation using the Java Hadoop APIs against Hadoop on Google Cloud Platform ?

Thanks!


Solution

  • It looks like you're running that code on your local machine and trying to connect to the Google Compute Engine VM; by default, GCE has strict firewall settings to avoid exposing your external IP addresses to arbitrary inbound connections. If you're using defaults then your Hadoop cluster should be on the "default" GCE network. You'll need to follow the adding a firewall instructions to allow incoming TCP connections on port 8020 and possible on other Hadoop ports as well from your local IP address for this to work. It'll look something like this:

    gcloud compute firewall-rules create allow-http \
        --description "Inbound HDFS." \
        --allow tcp:8020 \
        --format json \
        --source-ranges your.ip.address.here/32
    

    Note that you really want to avoid opening a 0.0.0.0/0 source-range since Hadoop isn't doing authentication or authorization on those incoming requests. You'll want to restrict it as much as possible to only the inbound IP addresses from which you plan to dial in. You may need to open up a couple other ports as well depending on what functionality you use connecting to Hadoop.

    The more general recommendation is that wherever possible, you should try to run your code on the Hadoop cluster itself; in that case, you'll use the master hostname itself as the HDFS authority rather than external IP:

    hdfs://<master hostname>/foo/bar
    

    That way, you can limit the port exposure to just the SSH port 22, where incoming traffic is properly gated by the SSH daemon, and then your code doesn't have to worry about what ports are open or even about dealing with IP addresses at all.