gettext binary MO file creating with Java

I tried creating a utility to parse gettext po file and generate binary mo file. The parser is simple (my co. not use fuzzy, plural, etc. things, just msgid/msgstr), but the generator is not work.

Here is the description of the mo file, here is the original generator source (it's C), and found a php script (https://github.com/josscrowcroft/php.mo/blob/master/php-mo.php) also.

My code:

public void writeFile(String filename, Map<String, String> polines) throws FileNotFoundException, IOException {

  DataOutputStream os = new DataOutputStream(new FileOutputStream(filename));
  HashMap<String, String> bvc = new HashMap<String, String>();
  TreeMap<String, String> hash = new TreeMap(bvc);
  hash.putAll(polines);


  StringBuilder ids = new StringBuilder();
  StringBuilder strings = new StringBuilder();
  ArrayList<ArrayList> offsets = new ArrayList<ArrayList>();
  ArrayList<Integer> key_offsets = new ArrayList<Integer>();
  ArrayList<Integer> value_offsets = new ArrayList<Integer>();
  ArrayList<Integer> temp_offsets = new ArrayList<Integer>();

  for (Map.Entry<String, String> entry : hash.entrySet()) {
    String id = entry.getKey();
    String str = entry.getValue();

    ArrayList<Integer> offsetsItems = new ArrayList<Integer>();
    offsetsItems.add(ids.length());
    offsetsItems.add(id.length());
    offsetsItems.add(strings.length());
    offsetsItems.add(str.length());
    offsets.add((ArrayList) offsetsItems.clone());

    ids.append(id).append('\0');
    strings.append(str).append('\0');
  }
  Integer key_start = 7 * 4 + hash.size() * 4 * 4;
  Integer value_start = key_start + ids.length();

  Iterator e = offsets.iterator();
  while (e.hasNext()) {
    ArrayList<Integer> offEl = (ArrayList<Integer>) e.next();
    key_offsets.add(offEl.get(1));
    key_offsets.add(offEl.get(0) + key_start);
    value_offsets.add(offEl.get(3));
    value_offsets.add(offEl.get(2) + value_start);
  }

  temp_offsets.addAll(key_offsets);
  temp_offsets.addAll(value_offsets);


  os.writeByte(0xde);
  os.writeByte(0x12);
  os.writeByte(0x04);
  os.writeByte(0x95);

  os.writeByte(0x00);
  os.writeInt(hash.size() & 0xff);
  os.writeInt((7 * 4) & 0xff);
  os.writeInt((7 * 4 + hash.size() * 8) & 0xff);
  os.writeInt(0x00000000);
  os.writeInt(key_start & 0xff);

  Iterator offi = temp_offsets.iterator();
  while (offi.hasNext()) {
    Integer off = (Integer) offi.next();
    os.writeInt(off & 0xff);
  }
  os.writeUTF(ids.toString());
  os.writeUTF(strings.toString());

  os.close();
}

The line os.writeInt(key_start); seems like ok, the differences from the original tool generated mo file starting after theese bytes.

What's wrong? (aside from my scary english..)

Solution

When comparing your implementation with the documentation I noticed two things:

~~The revision, directly after the magic number, should be an int.~~ This seems to work, probably because writeByte outputs some padding. Using writeInt would be clearer however.
The & 0xFF part in the writeInt calls is probably wrong. This operation is needed to convert a signed byte to its unsigned integer value, for positive integers it should not be needed.

For parsing of the po files you could also have a look at the zanata/tennera project on github.

Edit: The writeUTF call is also problematic since it prefixes the output with a two-byte length and mangles '\0' bytes using javas modified utf encoding. You could replace it by:

os.write(ids.toString().getBytes("utf-8"));
os.write(strings.toString().getBytes("utf-8"));

Another Edit: I could not let got of this code, there were further problems concerning string length in chars vs utf8 bytes and DataOutputStream writing in big-endian instead of little endian. I think the following code should work, the difference is that the file produced by msgfmt contains an optional hashtable to speed up access:

public static void writeInt(OutputStream os, int i) throws IOException {
    os.write((i) & 0xFF);
    os.write((i >>> 8) & 0xFF);
    os.write((i >>> 16) & 0xFF);
    os.write((i >>> 24) & 0xFF);
}

public static void writeFile(String filename, TreeMap<String, String> polines) throws IOException {
    OutputStream os = new BufferedOutputStream(new FileOutputStream(filename));
    ByteArrayOutputStream bos = new ByteArrayOutputStream();
    int size = polines.size();
    int[] indices = new int[size*2];
    int[] lengths = new int[size*2];
    int idx = 0;
    // write the strings and translations to a byte array and remember offsets and length in bytes
    for (String key : polines.keySet()) {
        byte[] utf = key.getBytes("utf-8");
        indices[idx] = bos.size();
        lengths[idx] = utf.length;
        bos.write(utf);
        bos.write(0);
        idx++;
    }
    for (String val : polines.values()) {
        byte[] utf = val.getBytes("utf-8");
        indices[idx] = bos.size();
        lengths[idx] = utf.length;
        bos.write(utf);
        bos.write(0);
        idx++;
    }

    try {
        int headerLength = 7*4;
        int tableLength = size*2*2*4;
        writeInt(os, 0x950412DE);                   // magic
        writeInt(os, 0);                            // file format revision
        writeInt(os, size);                         //number of strings
        writeInt(os, headerLength);                 // offset of table with original strings
        writeInt(os, headerLength + tableLength/2); // offset of table with translation strings
        writeInt(os, 0);                            // size of hashing table
        writeInt(os, headerLength + tableLength);   // offset of hashing table, not used since length is 0

        for (int i=0; i<size*2; i++) {
            writeInt(os, lengths[i]);
            writeInt(os, headerLength + tableLength + indices[i]);
        }

        // copy keys and translations
        bos.writeTo(os);

    } finally {
        os.close();
    }
}