Search code examples
compressionbarcodehuffman-codecode128

Efficient compression and representation of key value pairs to be read from 1D barcodes


I'm currently writing an application for Windows Mobile which needs to be able to pick up key value pairs from 1D barcodes (configuration settings). The less barcodes need to be scanned, the better. Sample input:

------------------------------
| Key | Value                |    
------------------------------
| 12  | Söme UTF-8 Strîng    |
|  9  | & another string     |
------------------------------

I thought of the following algorithm:

1. Concat the key value pairs and encode the values with Base64

So we would get something like 12=U8O2bWUgVVRGLTggU3Ryw65uZw==&9=JiBhbm90aGVyIHN0cmluZw==

2. Use Huffman encoding to compress the data

I'd use a fixed Huffman tree for this, with the following information that helps me to compress the data:

-------------------------------------------
| Enties                       | Priority |    
-------------------------------------------
| =, &                         | High     |
| 0-9                          | Medium   |
| 5-bit Base64 Words (w/o 0-9) | Low      |
-------------------------------------------

3. Generate Code 128B barcodes from the encoded data

Apply Base96 encoding to the bit stream generated by the Huffman algorithm to get ASCII chars which can be used within a Code 128B barcode. Split the resulting string into multiple barcodes as required.

Coding this steps won't be a problem for me, but I would like to have some feedback about the efficiency and the design of the algorithm.

Questions

  • Am I losing some potential for better compression/shorter strings somewhere?
  • Is there a better way to compress the random UTF8 encoded data?
  • Should I embed a dynamic Huffman table into the encoded data?
  • How can I take the compression of Code 128B into account (a 0 requires less space than a &)?

Solution

  • After a lot of playing and fiddling around, we finally choose this approach:

    1. Encode settings into a byte stream

    Field values are serialized into a byte stream, with a header for each field. The header consumes one byte and contains the ID of the field and some flags that help to reduce the amount of data to transport. Depending on the type of the field (e.g. a string, a number or an IP address), the value is efficiently encoded into the byte stream. For example, an IP address is encoded with 4 bytes, whereas a boolean flag is directly encoded into the field header. This way, we're capable to encode even SSL certificates into the stream, if required. As the typical barcode formats are not able to transport arbitrary byte values, we need to encode the byte stream in the next step.

    2. Convert to barcode format

    The resulting byte array is now treated as a big, big integer and converted into the target barcode format using a base encoding and a charset (see this question). This way, we efficiently use the barcode format to transport our data (in contrast to Base64 or other encodings). From the resulting string, we can chunk of single barcodes and add some additional header information to them (e.g. how many barcodes have to be scanned? is the data encrypted? ...).

    When the barcodes get scanned on a mobile device, the encoded string can be restored and converted into the same, big integer. This integer can then be treated as a byte array, that can be parsed when the field serialization format is known.

    This approach turned out to be very efficient and fast (we had some concerns regarding the BigInteger implementation on CF).