Search code examples
hadoopavrosequencefile

Avro file type for images?


I try to...figure that case in Hadoop.

What is best file format Avro or SequenceFile, in case storing images in HDFS and process them after, with Python?

SequenceFile are key-value oriented, so I think that Avro files will work better?


Solution

  • I use SequenceFile to store images in HDFS and it works well. Both Avro and SequenceFile are binary file formats, hence they can store images efficiently. As a keys in SequenceFile I usually use the original image file names.

    SequenceFile's are used in many image processing products, such as OpenIMAJ. You can use existing tools for working with images in SequenceFile's, for example OpenIMAJ SequenceFileTool.

    In addition, you can take a look at HipiImageBundle. This is a special format provided by HIPI (Hadoop Image Processing Interface). In my experience, HipiImageBundle has better performance, than the SequenceFile. But in can be used only by HIPI.

    If you don't have large number of files (less than 1M), you can try to store them without packaging in one big file and use CombineFileInputFormat to speedup processing.

    I never use Avro to store images and I don't know about any project that use it.