Search code examples
hadoophdfsdockercoreosdistributed-filesystem

CoreOS & HDFS - Running a distributed file system in Linux Containers/Docker


I need some sort of distributed file system running on a CoreOS cluster.

As such I'd like to run HDFS on CoreOS nodes. Is this possible?

I can see 2 options;

  1. Expand CoreOS - Install HDFS directly onto CoreOS - not ideal as it breaks the whole concept of CoreOS's containerisation and would mean installing a lot of additional components
  2. Somehow run HDFS in a Docker container on CoreOS and set affinities

Option 2 seems like the best approach, however, there are some potential blockers;

  • How do I reliably expose the physical disks to the Docker container running HDFS?
  • How do you scale container affinities?
  • How does this work the the Name nodes etc?

Cheers.


Solution

  • I'll try to provide two possibilities. I haven't tried either of these, so they are mostly suggestions. But could get you down the right path.

    The first, if you want to do HDFS and it requires device access on the host, would be to run the HDFS daemons in a privileged container that had access to the required host devices (the disks directly). See https://docs.docker.com/reference/run/#runtime-privilege-linux-capabilities-and-lxc-configuration for information on the --privileged and --device flags.

    In theory, you could pass the devices to the container that is handling the access to disks. Then you could use something like --link to talk to each other. The NameNode would store the metadata on the host using a volume (passed with -v). Though, given the little reading I have done about NameNode, it seems like there won't be a good solution yet for high availability anyways and it is a single point of failure.

    The second option to explore, if you are looking for a clustered file system and not HDFS in particular, would be to check out the recent Ceph FS support added to the kernel in CoreOS 471.1.0: https://coreos.com/releases/#471.1.0. You might then be able to use the same approach of privileged container to access host disks to build a Ceph FS cluster. Then you might have a 'data only' container that had Ceph tools installed to mount a directory on the Ceph FS cluster, and expose this as a volume for other containers to use.

    Though both of these are only ideas and I haven't used HDFS or Ceph personally (though I am keeping an eye on Ceph and would like to try something like this soon as a proof of concept).