Search code examples
htmlencodingchm

Microsoft CHM contents -- how to view them?


I have a .chm file (from 7-Zip, but I don't think it matters). I extracted the contents of the .chm and got the expected .hhc, .hhk, .htm, and .css files. However, I also got 10 more files with no extension, 8 of which beginning with a hash (e.g. '#OBJINST') and two of which with starting with a dollar sign. When trying to open these files in Atom or VSCode, I get a bunch of random characters (empty squares, triangles with question marks, and so on) with a few actual words scattered here and there like "HHA Version 4.74.8702" or "7zip.hhk".

I'm trying to parse these files to learn more about how .chm files work, and I'd really like to figure out how these extensionless files work/how they fit into the picture. I've done google searches, but nothing popped up that seemed relevant. It looks like something with the encoding, but none of Atom's encoding options fixed the probelm.

Any idea what's going on here? More specifically, how can I view the contents of these files (if I even can)?


Solution

  • You know Windows HTML Help is delivered as a LZX compressed binary file with the .chm extension. It contains a set of HTML files, a hyperlinked table of contents, and an index file. The file format has been reverse-engineered and documentation of it is freely available e.g. Unofficial (Preliminary) HTML Help Specification. This is the best I know.

    In relation to your question, you should look at the Internal file formats section in particular. Please also note the image in the $FIftiMain section.

    But I would like to warn you a bit about the wasted time in dealing with this internal file format.

    The file starts with bytes "ITSF" (in ASCII), for "Info-Tech Storage Format" (see Microsoft's HTML Help (.chm) format documentation). The CHM can be opened using FAR HTML like shown (see screenshot) in my answer of this SO thread to get CHM details from help ID

    For some more decompile info have a look at Decompile CHM too.