Search code examples
windowswinapiunicodeascii

Maximum number of characters output from Win32 ToUnicode()/ToAscii()


What is the maximum number of characters that could be output from the Win32 functions ToUnicode()/ToAscii()?

Surely there is a sensible upper bound on what it can output given a virtual key code, scan key code, and keyboard state?


Solution

  • On my Windows 8 machine USER32!ToAscii calls USER32!ToUnicode with a internal buffer and cchBuff set to 2. Because the output of ToAscii is a LPWORD and not a LPSTR we cannot assume anything about the real limits of ToUnicode from this investigation but we know that ToAscii is always going to output a WORD. The return value tells you if 0, 1 or 2 bytes of this WORD contains useful data.

    Moving on to ToUnicode and things get a bit trickier. If it returns 0 then nothing was written. If it returns 1 or -1 then one UCS-2 code point was written. We are then left with the strange 2 <= return expression. We can try to dissect the MSDN documentation:

    Two or more characters were written to the buffer specified by pwszBuff. The most common cause for this is that a dead-key character (accent or diacritic) stored in the keyboard layout could not be combined with the specified virtual key to form a single character. However, the buffer may contain more characters than the return value specifies. When this happens, any extra characters are invalid and should be ignored.

    You could interpret this as "two or more characters were written but only two of them are valid" but then the return value should be documented as 2 and not 2 ≤ value.

    I believe there are two things going on in that sentence and we should eliminate what it calls "extra characters":

    However, the buffer may contain more characters than the return value specifies.

    This just implies that the function may party on your buffer beyond what it is actually going to return as valid. This is confirmed by:

    When this happens, any extra characters are invalid and should be ignored.

    This just leaves us with the unfortunate opening sentence:

    Two or more characters were written to the buffer specified by pwszBuff.

    I have no problem imagining a return value of 2, it can be as simple as a base character combined with a diacritic that does not exist as a pre-composed code point.

    The "or more" part could come from multiple sources. If the base character is encoded as a surrogate-pair then any additional diacritic/combining-character will push you over 2. There could simply also be more than one diacritic/combining-character on the base character. There might even be a leading LTR/RTL mark.

    I don't know if it is possible to end up with all 3 conditions at the same time but I would play it safe and specify a buffer of 10 or so WCHARs. This should be well within the limits of what you can produce on a keyboard with "a single keystroke".

    This is by no means a final answer but it might be the best you are going to get unless somebody from Microsoft responds.