Search code examples
hadoopmapreducegoogle-compute-enginegoogle-hadoop

Job tracking URL in Google Compute engine not working


I am using Google Compute Engine to run Mapreduce jobs on Hadoop (pretty much all default configs). While running the job I get a tracking URL of the form http://PROJECT_NAME:8088/proxy/application_X_Y/ but it fails to open. Did I forget to configure something?


Solution

  • To elaborate on the option Amal mentioned in the other answer of using the "external ip address" of your Google Compute Engine VM, you can obtain the external IP address by running gcloud compute instances describe --zone <your zone> <your master hostname> and looking for natIP.

    To open port 8088, you'll have to set up a firewall rule opening that port, likely on your default Google Compute Engine network. You'll want to specify a your.ip.address.here/32 address in the --source-ranges to restrict incoming traffic to just your local machine dialing into your VM, otherwise the anyone in the IP source-ranges would be able to access your Hadoop pages.

    If you had used bdutil to turn up your cluster, there's an alternative way which is much easier and more secure; simply run

    bdutil <your flags used in deployment, like -e hadoop2, --prefix, etc.> socksproxy
    

    to open SSH with dynamic port forwarding to use as a SOCKS5 proxy that your browser can point to. If you're running on Linux or Mac and have Chrome or Firefox installed, bdutil should also print out a copy/paste command for starting a fresh isolated browser pre-configured to use the socks proxy so that you can click through all the useful links.

    If bdutil didn't print out a browser command or you didn't use bdutil, you can also run and configure your SSH socks proxy using these instructions. An SSH-based socks proxy is more secure than opening up firewall ports, and also allows the Hadoop page links to work (otherwise you have to keep manually replacing the hostnames with the external IP addresses).