database file filesystems wikipedia ext4

Best way to store 4.7 million binary files

I have parsed the whole english wikipedia and saved each parsed article in a separate protocol buffer file.Each file has a unique id (wikiid). I have now of 4.7 million parsed articles total size of 180 gb. I know ext4 can handle this amount of files but is it a good practice? or should I use database? I will not need to update it frequently.

Solution

Keep it as files - db is relatively more expensive to scale and maintain. Though you may want to be careful in how you name/store them -instead one directory having all the 4.7M files - have a directory structure that goes to say 4 levels. Preprocess the 4.7 M files to store in a directory structure. Say id of a file D1D2D3d4fewmorechars.txt - so now store this file in /D1/D2/D3/D4/D1D2D3D4fewmorechars.txt.

Or the other option is use file systems such as XFS, ext3/4 - that use directory indexing techniques such as hashed directories. Check this link - https://serverfault.com/questions/43133/filesystem-large-number-of-files-in-a-single-directory