I am working with Memory Mapped file in Java. I have a particular user id data stored in Avro Binary encoded format inside memory mapped file.
The memory mapped file consists of two main parts: - a header which serves as a an index into the full files contents, specifically answering questions the file as well as giving the offset into the file for each user's data. - a body followed by the data for each user in the file at the given offset.
Header
version 4 bytes
last_modified_date 8 bytes
users 4 bytes
shards 4 bytes
the shards N * 4 bytes
num_hash_index 4 bytes
num_chain_slots 4 bytes
user offset/size index num_hash_index * num_chain_slots * (8 bytes + 8 bytes + 4 bytes)
And now the header is followed by the body which is described as shown below.
Body
number of records 2 bytes how many records does this user have?
a repeated sequence of records variable size as described below
All the records follow this specification:
attribute key X bytes a string of the users key.
key delimiter 1 bytes '\0'
client id 2 bytes some client id
last modified time (in ms) 8 bytes This is the last modified time for this attribute in ms.
length of the avro binary data 2 bytes actual length of avro binary data
the binary avro data or text Y bytes Length given by the previous field.
Now I have lot of files already generated with the above format. I need to read this file from Java program. What is the best way to do this in Java? This is the first time I am working with Memory Mapped file so trying to understand how should I proceed on this?
FileChannel fc = new RandomAccessFile(new File("c:/tmp/file.txt"), "rw").getChannel();
Now I am not sure what I should do? Any example will help me to understand better.
This should do it. The key are the methods in DataInputStream which read and convert bytes. I suppose the endianness is suitable.
ByteBuffer buf = ByteBuffer.allocate( 9999 ); // capacity
int nRead = fc.read( buf );
InputStream is = new ByteArrayInputStream( buf.array() );
DataInputStream dis = new DataInputStream( is );
int version = dis.readInt(); // 4 bytes
long timestamp = dis.readLong(); // 8 bytes
int numUsers = dis.readInt(); // 4 bytes
And so on.
More details on the Body
There's no need to store the key delimiter ('\0') and the length of the avro data, which is expressed by the byte array's length. I'm using an int to store the short integers, just to be on the safe side (no unsigned short in Java),
public class UserAttribute {
private final String attributeKey;
private final int schemaId; // unsigned short
private final long lastModifiedDate;
private final byte[] avroBinaryData; // preceded by length: unsigned short
// constructor and getters here
}
int numberOfAttributes = dis.readShort();
List<UserAttribute> ual = new ArrayList<>( numberOfAttributes );
for( int iAttr = 0; iAttr < numberOfAttributes; ++iAttr ){
// read values for one attribute, create UserAttribute object
UserAttribute ua = new UserAttribute();
StringBuilder sb = new StringBuilder();
for(;;){
int ub = dis.readUnsignedByte(); // can this be in ISO-8859-1 > 0x80?
if( ub == 0 ) break;
sb.append( (char)ub );
}
ua.setAttributeKey( sb.toString() );
ua.setSchemaId( dis.readUnsignedShort() );
ua.setLastModifiedDate( dis.readLong() );
int loabd = dis.readUnsignedShort();
byte[] abd = new byte[loabd];
for( int ib = 0; ib < loabd; ++ib ){
abd[ib] = dis.readByte();
}
ua.setAvroBinaryData();
ual.add( ua );
}
Also, I think reading the shards should be
int numShards = dis.readInt(); // 4 bytes 1..101
int[] shards = new int[numShards];
for( il = 0; il < numShards; ++il ){
shards[il] = dis.readInt(); // N * 4 bytes Where N is the number of shards
}
Even Later Memory mapping
int read = ...;
FileChannel fc = new RandomAccessFile(file, "rw").getChannel();
ByteBuffer buffer = fc.map(FileChannel.MapMode.READ_ONLY, 0, read );
buffer.order(ByteOrder.BIG_ENDIAN);
This results in a ByteBuffer of the given length containing the file data. If the file is larger than 0x7fffffff, it must be mapped in chunks, which is possible using the same FileChannel method, i.e., map.