Search code examples
linuxfilesystemsiob-tree

Block based storage


I would like to store a couple of entries to a file (optimized for reading) and a good data structure for that seems to be a B+ tree. It offers a O(log(n)/log(b)) access time where b is the number of entries in one block.

There are many papers etc. describing B+ trees, but I still have some troubles understaning block based storage systems in general. Maybe someone can point me to the right direction or answer a couple of questions:

  1. Do (all common) file systems create new files at the beginning of a new block? So, can I be sure that seek(0) will set the read/write head to a multiply of the device's block size?
  2. Is it right that I only should use calls like pread(fd, buf, n * BLOCK_SIZE, p * BLOCK_SIZE) (with n, p being integers) to ensure that I always read full blocks?
  3. Is it better to read() BLOCK_SIZE bytes into an array or mmap() those instead? Or is there only a difference if I mmap many blocks and access only a few? What is better?
  4. Should I try to avoid keys spawning multiple blocks by adding padding bytes at the end of each block? Should I do the same for the leaf nodes by adding padding bytes between the data too?

Many thanks,
Christoph


Solution

    1. In general, file systems create new files at the beginning of a new block because that is how the underlying device works. Hard disks are block devices and thus cannot handle anything less than a "block" or "sector". Additionally, operating systems treat memory and memory mappings in terms of pages, which are usually even larger (sectors are often 512 or 1024 bytes, pages usually 4096 bytes).
      One exception to this rule that comes to mind would be ReiserFS, which puts small files directly into the filesystem structure (which, if I remember right, is incidentially a B+ tree!). For very small files this can actually a viable optimization since the data is already in RAM without another seek, but it can equally be an anti-optimization, depending on the situation.

    2. It does not really matter, because the operating system will read data in units of full pages (normally 4kB) into the page cache anyway. Reading one byte will transfer 4kB and return a byte, reading another byte will serve you from the page cache (if it's the same page or one that was within the readahead range).

    3. read is implemented by copying data from the page cache whereas mmap simply remaps the pages into your address space (possibly marking them copy-on-write, depending on your protection flags). Therefore, mmap will always be at least as fast and usually faster. mmap is more comfortable too, but has the disadvantage that it may block at unexpected times when it needs to fetch more pages that are not in RAM (though, that is generally true for any application or data that is not locked into memory). readon the other hand blocks when you tell it, not otherwise.
      The same is true under Windows with the exception that memory mapped files under pre-Vista Windows don't scale well under high concurrency, as the cache manager serializes everything.

    4. Generally one tries to keep data compact, because less data means fewer pages, and fewer pages means higher likelihood they're in the page cache and fit within the readahead range. Therefore I would not add padding, unless it is necessary for other reasons (alignment).