Search code examples
filefilesystemsperformancefilesize

Lots of small files or a couple huge ones?


In terms of performance and efficiency, is it better to use lots of small files (by lots I mean as much as a few million) or a couple (ten or so) huge (several gigabyte) files? Let's just say I'm building a database (not entirely true, but all that matters is that it's going to be accessed a LOT).

I'm mainly concerned with read performance. My filesystem is currently ext3 on Linux (Ubuntu Server Edition if it matters), although I'm in a position where I can still switch, so comparisons between different filesystems would be fabulous. For technical reasons I can't use an actual DBMS for this (hence the question), so "just use MySQL" is not a good answer.

Thanks in advance, and let me know if I need to be more specific.


EDIT: I'm going to be storing lots of relatively small pieces of data, which is why using lots of small files would be easier for me. So if I went with using a few large files, I'd only be retrieving a few KB out of them at a time. I'd also be using an index, so that's not really a problem. Also, some of the data points to other pieces of data (it would point to the file in the lots-of-small-files case, and point to the data's location within the file in the large-files case).


Solution

  • There are a lot of assumptions here but, for all intents and purposes, searching through a large file will much be quicker than searching through a bunch of small files.

    Let's say you are looking for a string of text contained in a text file. Searching a 1TB file will be much faster than opening 1,000,000 MB files and searching through those.

    Each file-open operation takes time. A large file only has to be opened once.

    And, in considering disk performance, a single file is much more likely to be stored contiguously than a large series of files.

    ...Again, these are generalizations without knowing more about your specific application.