Spark docker is installed in azure vm(centos 7.2), I want to access to hdfs from my local machine(Windows).
I run curl -i -v -L http://52.234.XXX.XXX:50070/webhdfs/v1/user/helloworld.txt?op=OPEN
in Windows, the exception is
$ curl -i -v -L http://52.234.XXX.XXX:50070/webhdfs/v1/user/helloworld.txt?op=OP EN
* timeout on name lookup is not supported
* Trying 52.234.XXX.XXX...
* TCP_NODELAY set
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Connected to 52.234.XXX.XXX (52.234.XXX.XXX) port 50070 (#0)
> GET /webhdfs/v1/user/helloworld.txt?op=OPEN HTTP/1.1
> Host: 52.234.XXX.XXX:50070
> User-Agent: curl/7.54.0
> Accept: */*
>
< HTTP/1.1 307 TEMPORARY_REDIRECT
< Cache-Control: no-cache
< Expires: Fri, 16 Mar 2018 02:16:37 GMT
< Date: Fri, 16 Mar 2018 02:16:37 GMT
< Pragma: no-cache
< Expires: Fri, 16 Mar 2018 02:16:37 GMT
< Date: Fri, 16 Mar 2018 02:16:37 GMT
< Pragma: no-cache
< Location: http://sandbox:50075/webhdfs/v1/user/helloworld.txt?op=OPEN&namenode rpcaddress=sandbox:9000&offset=0
< Content-Type: application/octet-stream
< Content-Length: 0
< Server: Jetty(6.1.26)
<
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
* Connection #0 to host 52.234.227.186 left intact
* Issue another request to this URL: 'http://sandbox:50075/webhdfs/v1/user/hello world.txt?op=OPEN&namenoderpcaddress=sandbox:9000&offset=0'
* timeout on name lookup is not supported
* Trying 10.122.118.83...
* TCP_NODELAY set
0 0 0 0 0 0 0 0 --:--:-- 0:00:21 --:--:-- 0HT TP/1.1 307 TEMPORARY_REDIRECT
Cache-Control: no-cache
Expires: Fri, 16 Mar 2018 02:16:37 GMT
Date: Fri, 16 Mar 2018 02:16:37 GMT
Pragma: no-cache
Expires: Fri, 16 Mar 2018 02:16:37 GMT
Date: Fri, 16 Mar 2018 02:16:37 GMT
Pragma: no-cache
Location: http://sandbox:50075/webhdfs/v1/user/helloworld.txt?op=OPEN&namenoderp caddress=sandbox:9000&offset=0
Content-Type: application/octet-stream
Content-Length: 0
Server: Jetty(6.1.26)
* connect to 10.122.118.83 port 50075 failed: Timed out
* Failed to connect to sandbox port 50075: Timed out
* Closing connection 1
curl: (7) Failed to connect to sandbox port 50075: Timed out
centos public ip address is : 52.234.XXX.XXX.
Is it caused by the unknow ip '10.122.118.83'? Is it the datanode ip address? I already open these ports in azure vm network setting.
I start docker with
docker run -it -p 8088:8088 -p 8042:8042 -p 9000:9000 -p 8087:8087 -p 50070:50070 -p 50010:50010 -p 50075:50075 -p 50475:50475 --name sparkdocker -h sandbox --network=host sequenceiq/spark:1.6.0 bash
The fs.defaultFS for hadoop is 'hdfs://sandbox:9000'
There is no problem for centos and other azure machines in the same resource group to visit hdfs(upload, download, read files).
spark docker ifconfig:
docker0 Link encap:Ethernet HWaddr 02:42:D9:2A:5D:BB
inet addr:172.17.0.1 Bcast:172.17.255.255 Mask:255.255.0.0
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:53 errors:0 dropped:0 overruns:0 frame:0
TX packets:57 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:3889 (3.7 KiB) TX bytes:6674 (6.5 KiB)
eth0 Link encap:Ethernet HWaddr 00:0D:3A:14:B5:C1
inet addr:10.0.0.7 Bcast:10.0.0.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:60543 errors:0 dropped:0 overruns:0 frame:0
TX packets:68081 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:22930277 (21.8 MiB) TX bytes:11271703 (10.7 MiB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:14779 errors:0 dropped:0 overruns:0 frame:0
TX packets:14779 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:4032619 (3.8 MiB) TX bytes:4032619 (3.8 MiB)
centos vm ifconfig:
docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
inet 172.17.0.1 netmask 255.255.0.0 broadcast 172.17.255.255
ether 02:42:d9:2a:5d:bb txqueuelen 0 (Ethernet)
RX packets 53 bytes 3889 (3.7 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 57 bytes 6674 (6.5 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.0.0.7 netmask 255.255.255.0 broadcast 10.0.0.255
ether 00:0d:3a:14:b5:c1 txqueuelen 1000 (Ethernet)
RX packets 60750 bytes 23017881 (21.9 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 68320 bytes 11310643 (10.7 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
loop txqueuelen 1 (Local Loopback)
RX packets 14857 bytes 4042781 (3.8 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 14857 bytes 4042781 (3.8 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
Your remote hostname cannot be sandbox
using a local ip of 10.0.0.7
if you want expose it to an external network. It needs to an externally resolvable IP or DNS record throughout the entire request due to the various network calls between the datanodes and the namenode back to your external client in a remote network.
Same for the YARN services, by looking at your cluster on port 8088
I believe it's a setting in the core-site.xml, this needs to be something like hdfs://external.namenode.fqdn:port
fs.default.name
And in hdfs-site.xml set both to true - because in cloud environments, your hostnames are typically static while IPs can change. Also, within the Azure network, nodes know how to communicate, but outside the cluster, the internal DNS names cannot be resolved
dfs.client.use.datanode.hostname
dfs.datanode.use.datanode.hostname
If you're running in Azure, I might suggest just using HD insights rather than some single datanode sandbox
In any case, you don't need a remote Spark instance. You can develop locally. Deploy that Spark application to a remote YARN (or Spark Standalone) cluster. You also don't need HDFS... Spark can read from Azure blob store, and be ran within a standalone scheduler
Another suggestion: never open all the ports to an insecure Hadoop cluster and post the public IP to the web. Please use SSH forwarding on your end to securely connect into the Azure network