Search code examples
databaseimagehadoophdfsimpala

How to store Image file in Impala


I have a Image file (jpg or jpeg) in local system and I would like to store in Impala Database, please assist me how could I do that?


Solution

  • I think you have a few ways of solving this depending on what is your exact requirement.

    1. Using Hive

    Hive allows you to store binary data in the Hive 'database'. Hive is similar to Impala although generally slower but with more functionality. You can use the DataType BINARY in the table definition and load images using LOAD DATA. Something like this might work (not tested).

    Create table images (picture binary); 
    LOAD DATA LOCAL inpath 'x/y/image.jpg' INTO TABLE images;
    

    2. Using Impala

    Impala does not allow binary data. What you can do is use a serialize-deserialize methodology. This means you convert your image to a String format that still contains all information necessary to transform it back. Once you need to retrieve an image on HDFS you will need to deserialize, meaning converting the string to the original format.

    Using Python for example this would work like this:

    import base64
    
    def img_to_string(image_path):
        with open(image_path, "rb") as imageFile:
            image_string= base64.b64encode(imageFile.read())
            print image_string
    
    def string_to_img(image_string):
        with open("new_image.png", "wb") as imageFile:
            imageFile.write(str.decode('base64'))
    

    3. Using HDFS only

    Often storing the data in a database is not required. What you could do is just place the images in HDFS. If necessary you could keep the HDFS file path stored in a database. You can then retrieve the path using an Impala query. Getting a file from a remote location then requires you to run the following (more information here):

    ssh <user>@<host> "hadoop fs -get <hdfs_path> <os_path>"
    then scp command to copy files