Store documents (.pdf, .doc and .txt files) in MaprDB

I need to store documents such as .pdf, .doc and .txt files to MaprDB. I saw one example in Hbase where it stores files in binary and is retrieved as files in Hue, but I not sure how it could be implemented. Any idea how can a document be stored in MaprDB?

Solution

First thing is , Im not aware about Maprdb as Im using Cloudera. But I have experience in hbase storing many types of objects in hbase as byte array like below mentioned.

Most primitive way of storing in hbase or any other db is byte array. see my answer

You can do that in below way using Apache commons lang API. probably this is best option, which will be applicable to all objects including image/audio/video etc..

please test this method with one of object type of any of your files. SerializationUtils.serialize will return bytes. which you can insert.

import org.apache.commons.lang.SerializationUtils;
/**
* testSerializeAndDeserialize.
*
**/
public void testSerializeAndDeserialize throws Exception {

//serialize here
    byte[] bytes = SerializationUtils.serialize("your object here which is of type f  .pdf, .doc and .txt ");


 // deserialize the same here and see you are getting back or not.
 yourobjecttype objtypeofpdfortxtordoc = (yourobjecttype) SerializationUtils.deserialize(bytes);

}

Note :jar of apache commons lang always available in hadoop cluster.(not external dependency)

another example :

import java.io.FileInputStream;
import java.io.FileOutputStream;

import org.apache.commons.lang.SerializationUtils;

public class SerializationUtilsTrial {
  public static void main(String[] args) {
    try {
      // File to serialize object to
      String fileName = "testSerialization.ser";

      // New file output stream for the file
      FileOutputStream fos = new FileOutputStream(fileName);

      // Serialize String
      SerializationUtils.serialize("SERIALIZE THIS", fos);
      fos.close();

      // Open FileInputStream to the file
      FileInputStream fis = new FileInputStream(fileName);

      // Deserialize and cast into String
      String ser = (String) SerializationUtils.deserialize(fis);
      System.out.println(ser);
      fis.close();
    } catch (Exception e) {
      e.printStackTrace();
    }
  }
}

For any reason if you don't want to use SerializationUtils class provided by Apache commons lang, then you can see below pdf serialize and deserialize example for your better understanding but its lengthy code if you use SerializationUtils the code will be reduced.

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.logging.Level;
import java.util.logging.Logger;

public class PdfSerializeAndDeserExample {

    public static void main(String[] args) throws FileNotFoundException, IOException {
        File file = new File("someFile.pdf");

        FileInputStream fis = new FileInputStream(file);
        //System.out.println(file.exists() + "!!");
        //InputStream in = resource.openStream();
        ByteArrayOutputStream bos = new ByteArrayOutputStream();
        byte[] buf = new byte[1024];
        try {
            for (int readNum; (readNum = fis.read(buf)) != -1;) {
                bos.write(buf, 0, readNum); //no doubt here is 0
                //Writes len bytes from the specified byte array starting at offset off to this byte array output stream.
                System.out.println("read " + readNum + " bytes,");
            }
        } catch (IOException ex) {
            Logger.getLogger(genJpeg.class.getName()).log(Level.SEVERE, null, ex);
        }
        byte[] bytes = bos.toByteArray();

Above you are getting byte array you can prepare put request to upload to database i.e Hbase or any other database

Once you persisted, you can get the same using hbase get or `scan` you `get` your pdf bytes and use the below code to again make same file i.e someFile.pdf in this case.

        File someFile = new File("someFile.pdf");
        FileOutputStream fos = new FileOutputStream(someFile);
        fos.write(bytes);
        fos.flush();
        fos.close();
    }
}

EDIT : Since you asked HBASE examples I'm adding this.. in the below method

yourcolumnasBytearray is your doc file for instance pdf.. converted to byte array (using SerializationUtils.serialize) in above examples...

  /**
 * Put (or insert) a row
 */
@Override
public void addRecord(final String tableName, final String rowKey, final String family, final String qualifier,
                final byte[] yourcolumnasBytearray) throws Exception {
    try {
        final HTableInterface table = HBaseConnection.getHTable(getTable(tableName));
        final Put put = new Put(Bytes.toBytes(rowKey));
        put.add(Bytes.toBytes(family), Bytes.toBytes(qualifier), yourcolumnasBytearray);
        table.put(put);
        LOG.info("INSERT record " + rowKey + " to table " + tableName + " OK.");
    } catch (final IOException e) {
        printstackTrace(e);
    }

Store documents (.pdf, .doc and .txt files) in MaprDB

Note :jar of apache commons lang always available in hadoop cluster.(not external dependency)

Above you are getting byte array you can prepare put request to upload to database i.e Hbase or any other database

Once you persisted, you can get the same using hbase get or scan you get your pdf bytes and use the below code to again make same file i.e someFile.pdf in this case.

EDIT : Since you asked HBASE examples I'm adding this.. in the below method

Once you persisted, you can get the same using hbase get or `scan` you `get` your pdf bytes and use the below code to again make same file i.e someFile.pdf in this case.