Search code examples
javaarraysniomemory-mapped-filesbytebuffer

How to read memory mapped file which is in particular format?


I am working with Memory Mapped file in Java. I have a particular user id data stored in Avro Binary encoded format inside memory mapped file.

The memory mapped file consists of two main parts: - a header which serves as a an index into the full files contents, specifically answering questions the file as well as giving the offset into the file for each user's data. - a body followed by the data for each user in the file at the given offset.

Header

version                     4 bytes
last_modified_date          8 bytes
users                       4 bytes
shards                      4 bytes
the shards                  N * 4 bytes
num_hash_index              4 bytes
num_chain_slots             4 bytes
user offset/size index      num_hash_index * num_chain_slots * (8 bytes + 8 bytes + 4 bytes)

And now the header is followed by the body which is described as shown below.

Body

number of records                   2 bytes         how many records does this user have?
a repeated sequence of records      variable size   as described below

All the records follow this specification:

attribute key                       X bytes     a string of the users key.
key delimiter                       1 bytes     '\0'
client id                           2 bytes     some client id
last modified time (in ms)          8 bytes     This is the last modified time for this attribute in ms.
length of the avro binary data      2 bytes     actual length of avro binary data
the binary avro data or text        Y bytes     Length given by the previous field.

Now I have lot of files already generated with the above format. I need to read this file from Java program. What is the best way to do this in Java? This is the first time I am working with Memory Mapped file so trying to understand how should I proceed on this?

FileChannel fc = new RandomAccessFile(new File("c:/tmp/file.txt"), "rw").getChannel();

Now I am not sure what I should do? Any example will help me to understand better.


Solution

  • This should do it. The key are the methods in DataInputStream which read and convert bytes. I suppose the endianness is suitable.

     ByteBuffer buf = ByteBuffer.allocate( 9999 ); // capacity
     int nRead = fc.read( buf );
     InputStream is = new ByteArrayInputStream( buf.array() );
     DataInputStream dis = new DataInputStream( is );
     int version = dis.readInt(); //                   4 bytes
     long timestamp = dis.readLong();  //                 8 bytes
     int numUsers = dis.readInt(); //                   4 bytes
    

    And so on.

    More details on the Body

    There's no need to store the key delimiter ('\0') and the length of the avro data, which is expressed by the byte array's length. I'm using an int to store the short integers, just to be on the safe side (no unsigned short in Java),

    public class UserAttribute {
      private final String attributeKey;
      private final int schemaId;               // unsigned short
      private final long lastModifiedDate;
      private final byte[] avroBinaryData;      // preceded by length: unsigned short
      // constructor and getters here
    
    }
    
    int numberOfAttributes = dis.readShort();
    List<UserAttribute> ual = new ArrayList<>( numberOfAttributes );
    for( int iAttr = 0; iAttr < numberOfAttributes; ++iAttr ){
        // read values for one attribute, create UserAttribute  object
        UserAttribute ua = new UserAttribute();
        StringBuilder sb = new StringBuilder();
        for(;;){
            int ub = dis.readUnsignedByte(); // can this be in ISO-8859-1 > 0x80?
            if( ub == 0 ) break;
            sb.append( (char)ub );
        }
        ua.setAttributeKey( sb.toString() );
        ua.setSchemaId( dis.readUnsignedShort() );
        ua.setLastModifiedDate( dis.readLong() );
        int loabd = dis.readUnsignedShort();
        byte[] abd = new byte[loabd];
        for( int ib = 0; ib < loabd; ++ib ){
            abd[ib] = dis.readByte();
        }
        ua.setAvroBinaryData();
        ual.add( ua );
    }
    

    Also, I think reading the shards should be

    int numShards = dis.readInt(); // 4 bytes 1..101
    int[] shards = new int[numShards];
    for( il = 0; il < numShards; ++il ){
        shards[il] = dis.readInt(); //  N * 4 bytes     Where N is the number of shards
    }
    

    Even Later Memory mapping

    int read = ...;
    FileChannel fc = new RandomAccessFile(file, "rw").getChannel();
    ByteBuffer buffer = fc.map(FileChannel.MapMode.READ_ONLY, 0, read );
    buffer.order(ByteOrder.BIG_ENDIAN);
    

    This results in a ByteBuffer of the given length containing the file data. If the file is larger than 0x7fffffff, it must be mapped in chunks, which is possible using the same FileChannel method, i.e., map.