Search code examples
azurehadoopapache-sparkdockerifconfig

access from local machine to spark docker in azure vm


Spark docker is installed in azure vm(centos 7.2), I want to access to hdfs from my local machine(Windows).

I run curl -i -v -L http://52.234.XXX.XXX:50070/webhdfs/v1/user/helloworld.txt?op=OPEN in Windows, the exception is

$ curl -i -v -L http://52.234.XXX.XXX:50070/webhdfs/v1/user/helloworld.txt?op=OP                                                                              EN
* timeout on name lookup is not supported
*   Trying 52.234.XXX.XXX...
* TCP_NODELAY set
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*                                                                               Connected to 52.234.XXX.XXX (52.234.XXX.XXX) port 50070 (#0)
> GET /webhdfs/v1/user/helloworld.txt?op=OPEN HTTP/1.1
> Host: 52.234.XXX.XXX:50070
> User-Agent: curl/7.54.0
> Accept: */*
>
< HTTP/1.1 307 TEMPORARY_REDIRECT
< Cache-Control: no-cache
< Expires: Fri, 16 Mar 2018 02:16:37 GMT
< Date: Fri, 16 Mar 2018 02:16:37 GMT
< Pragma: no-cache
< Expires: Fri, 16 Mar 2018 02:16:37 GMT
< Date: Fri, 16 Mar 2018 02:16:37 GMT
< Pragma: no-cache
< Location: http://sandbox:50075/webhdfs/v1/user/helloworld.txt?op=OPEN&namenode                                                                              rpcaddress=sandbox:9000&offset=0
< Content-Type: application/octet-stream
< Content-Length: 0
< Server: Jetty(6.1.26)
<
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
* Connection #0 to host 52.234.227.186 left intact
* Issue another request to this URL: 'http://sandbox:50075/webhdfs/v1/user/hello                                                                              world.txt?op=OPEN&namenoderpcaddress=sandbox:9000&offset=0'
* timeout on name lookup is not supported
*   Trying 10.122.118.83...
* TCP_NODELAY set
  0     0    0     0    0     0      0      0 --:--:--  0:00:21 --:--:--     0HT                                                                              TP/1.1 307 TEMPORARY_REDIRECT
Cache-Control: no-cache
Expires: Fri, 16 Mar 2018 02:16:37 GMT
Date: Fri, 16 Mar 2018 02:16:37 GMT
Pragma: no-cache
Expires: Fri, 16 Mar 2018 02:16:37 GMT
Date: Fri, 16 Mar 2018 02:16:37 GMT
Pragma: no-cache
Location: http://sandbox:50075/webhdfs/v1/user/helloworld.txt?op=OPEN&namenoderp                                                                              caddress=sandbox:9000&offset=0
Content-Type: application/octet-stream
Content-Length: 0
Server: Jetty(6.1.26)

* connect to 10.122.118.83 port 50075 failed: Timed out
* Failed to connect to sandbox port 50075: Timed out
* Closing connection 1
curl: (7) Failed to connect to sandbox port 50075: Timed out

centos public ip address is : 52.234.XXX.XXX.

Is it caused by the unknow ip '10.122.118.83'? Is it the datanode ip address? I already open these ports in azure vm network setting.

I start docker with
docker run -it -p 8088:8088 -p 8042:8042 -p 9000:9000 -p 8087:8087 -p 50070:50070 -p 50010:50010 -p 50075:50075 -p 50475:50475 --name sparkdocker -h sandbox --network=host sequenceiq/spark:1.6.0 bash The fs.defaultFS for hadoop is 'hdfs://sandbox:9000' There is no problem for centos and other azure machines in the same resource group to visit hdfs(upload, download, read files).

spark docker ifconfig:

docker0   Link encap:Ethernet  HWaddr 02:42:D9:2A:5D:BB
      inet addr:172.17.0.1  Bcast:172.17.255.255  Mask:255.255.0.0
      UP BROADCAST MULTICAST  MTU:1500  Metric:1
      RX packets:53 errors:0 dropped:0 overruns:0 frame:0
      TX packets:57 errors:0 dropped:0 overruns:0 carrier:0
      collisions:0 txqueuelen:0
      RX bytes:3889 (3.7 KiB)  TX bytes:6674 (6.5 KiB)

eth0      Link encap:Ethernet  HWaddr 00:0D:3A:14:B5:C1
          inet addr:10.0.0.7  Bcast:10.0.0.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:60543 errors:0 dropped:0 overruns:0 frame:0
          TX packets:68081 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:22930277 (21.8 MiB)  TX bytes:11271703 (10.7 MiB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:14779 errors:0 dropped:0 overruns:0 frame:0
          TX packets:14779 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1
          RX bytes:4032619 (3.8 MiB)  TX bytes:4032619 (3.8 MiB)

centos vm ifconfig:

docker0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
    inet 172.17.0.1  netmask 255.255.0.0  broadcast 172.17.255.255
    ether 02:42:d9:2a:5d:bb  txqueuelen 0  (Ethernet)
    RX packets 53  bytes 3889 (3.7 KiB)
    RX errors 0  dropped 0  overruns 0  frame 0
    TX packets 57  bytes 6674 (6.5 KiB)
    TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.0.0.7  netmask 255.255.255.0  broadcast 10.0.0.255
        ether 00:0d:3a:14:b5:c1  txqueuelen 1000  (Ethernet)
        RX packets 60750  bytes 23017881 (21.9 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 68320  bytes 11310643 (10.7 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        loop  txqueuelen 1  (Local Loopback)
        RX packets 14857  bytes 4042781 (3.8 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 14857  bytes 4042781 (3.8 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Solution

  • Your remote hostname cannot be sandbox using a local ip of 10.0.0.7 if you want expose it to an external network. It needs to an externally resolvable IP or DNS record throughout the entire request due to the various network calls between the datanodes and the namenode back to your external client in a remote network.

    Same for the YARN services, by looking at your cluster on port 8088

    I believe it's a setting in the core-site.xml, this needs to be something like hdfs://external.namenode.fqdn:port

    fs.default.name
    

    And in hdfs-site.xml set both to true - because in cloud environments, your hostnames are typically static while IPs can change. Also, within the Azure network, nodes know how to communicate, but outside the cluster, the internal DNS names cannot be resolved

    dfs.client.use.datanode.hostname 
    dfs.datanode.use.datanode.hostname
    

    If you're running in Azure, I might suggest just using HD insights rather than some single datanode sandbox

    In any case, you don't need a remote Spark instance. You can develop locally. Deploy that Spark application to a remote YARN (or Spark Standalone) cluster. You also don't need HDFS... Spark can read from Azure blob store, and be ran within a standalone scheduler

    Another suggestion: never open all the ports to an insecure Hadoop cluster and post the public IP to the web. Please use SSH forwarding on your end to securely connect into the Azure network