I'm using Hortonbox 3.0.1 on a virtual box and ssh into it using putty. I have some files in my local machine (Windows 10), which I want to store in the hadoop file system.
SSH-ing into hortonbox instance, gives me a terminal of the instance, which means all files from the windows instance are not visible to the terminal. Is there any way I can put files into the HDFS instance?
I am aware of WinSCP but that does not really serve my purpose. WinSCP would mean me sending the file onto the system, using my ssh to store the file on hadoop, and then deleting the file from the system after storing on data nodes. I might be wrong but this seems like additional and redundant work and I would always need a buffer for storage where hadoop is running, for extremely large files, this solution will almost certainly fail considering I would first need to store the entire file on the secondary disk, then send it to the data nodes through the name node. Is there any way to achieve this or the problem I'm facing is due to using a hortonbox instance? How does organizations handle sending data from several nodes to the namenode and then to datanodes?
First, you don't send data to the namenode for it to be placed on datanodes. When you issue hdfs put
commands, the only information requested from the namenode is locations for the files to be placed.
That being said, if you want to skip SSH entirely, you need to forward the Namenode and datanode ports from the VM to your host, then install and configure the hadoop fs
/hdfs
commands on your windows host such that you can issue them directly from CMD.
The alternative is to use Fuse/SFTP/NFS/Samba mounts (referred to as a "shared folder" in the Virtualbox GUI) from Windows into the VM, where you could then run put
without copying anything into the VM