Search code examples
distributed-computingdistributed-filesystemgfs

Why do small files create hot spots in the Google File System?


I don't understand this from the Google File Systems Paper

A small file consists of a small number of chunks, perhaps just one. The chunkservers storing those chunks may become hot spots if many clients are accessing the same file.

What difference does a small file make? Aren't large files being accessed by many clients equally likely to cause problems?

I've thought / read the following:-

  • I assume (correct me if I'm wrong) that the chunks of large files are stored on different chunkservers thereby distributing load. In such a scenario say 1000 clients access 1/100th of the file from each chunkserver. So each chunkserver inevitably ends up getting 1000 requests. (Isn't is the same as 1000 clients accessing a single small file. The server gets 1000 requests for small files or 1000 requests for parts of a larger file)
  • I read a little about Sparse files. Small files according to the paper fill up a chunk or several chunks. So to my understanding small files aren't reconstructed and hence I've eliminated this as the probable cause for the hot spots.

Solution

  • Some of the subsequent text can help clarify:

    However, hot spots did develop when GFS was first used by a batch-queue system: an executable was written to GFS as a single-chunkfile and then started on hundreds of machines at the same time. The few chunkservers storing this executable were overloaded by hundreds of simultaneous requests. We fixed this problem by storing such executables with a higher replication factor and by making the batchqueue system stagger application start times. A potential long-term solution is to allow clients to read data from other clients in such situations.

    If 1000 clients want to read a small file at the same time, the N chunkservers holding its only chunk will receive 1000 / N simultaneous requests. This sudden load is what's meant by a hot spot.

    Large files aren't going to be read all at once by a given client (after all, they are large). Instead, they're going to load some portion of the file, work on it, then move on to the next portion.

    In a sharding (MapReduce, Hadoop) scenario, workers may not even read the same chunks at all; one client out of N will read 1/N chunks of the file, distinct from the others.

    Even in a non-sharding scenario, in practice clients will not be completely synchronized. They may all end up reading the entire file, but with a random access pattern so that statistically there is no hotspotting. Or if they do read it sequentially, they will get out of sync because of difference in workload (unless you're purposefully synchronizing the clients....but don't do that).

    So even with lots of clients, larger files get less hot spotting due to the nature of work that large files entail. It's not guaranteed, which is what I think you are saying in your question, but in practice distributed clients won't work in tandem on every chunk of a multi-chunk file.