I've just started working on a hadoop based ingester for open street map data. There are a few formats - but I've been targeting a protocolbuffer based format (note - it's not pure pb).
It's looking to me like it would be more efficient to pre-split the file into a sequence file - as opposed to handling the variable-length encoding in a custom record reader / input format - but would like a sanity check.
The format is described in more detail at PBF Format Description But basically it's a collection of [BlobHeader,Blob] blocks.
There's a Blob Header
message BlobHeader {
required string type = 1;
optional bytes indexdata = 2;
required int32 datasize = 3;
}
And then the Blob (the size of which is defined by the datasize parameter in the header)
message Blob {
optional bytes raw = 1; // No compression
optional int32 raw_size = 2; // Only set when compressed, to the uncompressed size
optional bytes zlib_data = 3;
// optional bytes lzma_data = 4; // PROPOSED.
// optional bytes OBSOLETE_bzip2_data = 5; // Deprecated.
}
There's more structure once you get down into the blob obviously - but I would handle that in the mapper - what I would like to do is initially have one blob per mapper (later might be some multiple of blobs per mapper).
Some of the other input formats/record readers use a "big enough" split size, and then seek backwards/forwards to a delimiter - but since there is no delimiter that would let me know the offset of blobs/headers - and no index that points to them either - I can't see any way to get my split points without first streaming through the file.
Now I wouldn't need to actually read the entire file off of disks - I could start with reading the header, using that info to seek past the blob, set that as the first split point, then repeat. But that's about the only alternative to pre-splitting into a sequence file I can come up with.
Is there a better way to handle this - or if not, thoughts on the two suggestions?
Well, I went with parsing the binary file in the getSplits method -and since i'm skipping over 99% of the data it's plenty fast (~20 seconds for the planet-osm 22GB world file). Here's the getSplits method if anyone else stumbles along.
@Override
public List<InputSplit> getSplits(JobContext context){
List<InputSplit> splits = new ArrayList<InputSplit>();
FileSystem fs = null;
Path file = OSMPBFInputFormat.getInputPaths(context)[0];
FSDataInputStream in = null;
try {
fs = FileSystem.get(context.getConfiguration());
in = fs.open(file);
long pos = 0;
while (in.available() > 0){
int len = in.readInt();
byte[] blobHeader = new byte[len];
in.read(blobHeader);
BlobHeader h = BlobHeader.parseFrom(blobHeader);
FileSplit split = new FileSplit(file, pos,len + h.getDatasize(), new String[] {});
splits.add(split);
pos += 4;
pos += len;
pos += h.getDatasize();
in.skip(h.getDatasize());
}
} catch (IOException e) {
sLogger.error(e.getLocalizedMessage());
} finally {
if (in != null) {try {in.close();}catch(Exception e){}};
if (fs != null) {try {fs.close();}catch(Exception e){}};
}
return splits;
}
working fine so far - though I haven't ground truthed the output yet. It's definitley faster than copying the pbf to hdfs, converting to a sequence in a single mapper, then ingesting (copy time dominates). It's also ~20% faster than having an external program copy to a sequence file in hdfs, then running a mapper against hdfs (scripted the latter). So no complaints here.
Note that this generates a mapper for every block - which is ~23k mappers for the planet world file. I'm actually bundling up multiple blocks per split - just loop through x numbers of times before a split gets added to the collection.
For the BlobHeader I just compiled the protobuf .proto file from the OSM wiki link above. You can also pull it pre-generated from the OSM-binary class if you want - maven fragment is:
<dependency>
<groupId>org.openstreetmap.osmosis</groupId>
<artifactId>osmosis-osm-binary</artifactId>
<version>0.43-RELEASE</version>
</dependency>