Search code examples
scalabinaryfilesbinary-dataakka-streamdecoder

Marc21 Binary Decoder with Akka-Stream


I'm trying to decode Marc21 binary data records which have the following specification concerning the field that provide the length of the record.

A Computer-generated, five-character number equal to the length of the entire record, including itself and the record terminator. The number is right justified and unused positions contain zeros.

I am trying to use

Akka Stream Framing.lengthField, however I just don't know how specify the size of that field. I imagine that a character is 8 bit, maybe 16 for a number, i am not sure, i wonder if that depend of the platform or language. In short, the question is is it possible to say what is the size of that field Knowing that i am in Scala/Java.

Also what does means:

The number is right justified and unused positions contain zeros"

Does that has implication on how one read the value if collected properly ?

If anyone know anything about this, please share.

EDIT1

Context:

I am trying to build a stream processing graph where the first stage would be processing the result of a sys command ran against a symphony (Vendor Cataloging system) server, which is a stream of unstructured byte chunck which as a whole represent all the Marc21 records Requested (full dump or partial dump).

By processing i mean, chunking that unstructured stream of byte into a stream of frames where the frames are the Records.

In other words, readying the bytes for one record at the time, and emitting it individually to the next stage.

The next stage will consist in emitting that record (Bytes) to apache Kafka.

Obviously the emission stage would be fully parallelize to speed up the process.

The Symphony server does not have the capability to stream a dump when requested, especially over the network. Hence, this Akka-stream based Graph processing to perform that work, for fast ingestion/production and overall streaming processing of our dumps in our overall fast data infrastructure.

EDIT2

Based on @badcook input, I wonder if ComputeFramesize could be used here. Not sure i am slightly confused by the function and what does it takes into parameters.

Little clarification would be much appreciated.


Solution

  • It looks like you're trying to parse MARC 21 records.

    In that case I would recommend you just take a look at MARC4J and use that.

    If you want to integrate it with Akka streams, or even if you want to parse MARC records your own way, I would recommend breaking up your byte steam with Framing.delimiter using the MARC 21 record terminator (ASCII control character 1D) into complete MARC records rather than try to stream and work with fragments of MARC records. It'll be a lot easier.

    As for your specific questions: The MARC 21 specification uses characters rather than raw bytes when talking about its structure. It specifies two character encodings into raw bytes, UTF-8 and MARC 8, both of which are variable width encodings. Hence, no it is not true that every character is a byte. There is no single answer of how many bytes a character takes up.

    "[R]ight justified and unused positions contain zeroes" is another way of saying that numbers are padded from the left with 0s. In this case this line comes from a larger quote staying that the numerical string must be 5 characters long. That means if you are trying to represent the number 1, you must represent it as 00001.