fscanf read()s more than the number of characters I asked for

I have the following code:

#include <stdio.h>

int main(void)
{
  unsigned char c;

  setbuf(stdin, NULL);
  scanf("%2hhx", &c);
  printf("%d\n", (int)c);
  return 0;
}

I set stdin to be unbuffered, then ask scanf to read up to 2 hex characters. Indeed, scanf does as asked; for example, having compiled the code above as foo:

$ echo 23 | ./foo
35

However, if I strace the program, I find that libc actually read 3 characters. Here is a partial log from strace:

$ echo 234| strace ./foo
read(0, "2", 1)                         = 1
read(0, "3", 1)                         = 1
read(0, "4", 1)                         = 1
35 # prints the correct result

So sscanf is giving the expected result. However, this extra character being read is detectable, and it happens to break the communications protocol I am trying to implement (in my case, GDB remote debugging).

The man page for sscanf says about the field width:

Reading of characters stops either when this maximum is reached or when a nonmatching character is found, whichever happens first.

This seems a bit deceptive, at least; or is it in fact a bug? Is it too much to hope that with unbuffered stdin, scanf might read no more than the amount of input I asked for?

(I'm running on Ubuntu 18.04 with glibc 2.27; I've not tried this on other systems.)

Solution

This seems a bit deceptive, at least; or is it in fact a bug?

IMO, no.

An input item is read from the stream, ... An input item is defined as the longest sequence of input characters which does not exceed any specified field width and which is, or is a prefix of, a matching input sequence. The first character, if any , after the input item remains unread. If the length of the input item is zero, the execution of the directive fails; this condition is a matching failure unless end-of-file, an encoding error, or a read error prevented input from the stream, in which case it is an input failure. C17dr § 7.21.6.2 9

Code such as "%hhx" (without a width limit) certainly must get 1 past the hex characters to know it is done. That excess character is pushed-back into stdin for the next input operation.

The "The first character, if any, after the input item remains unread" implies to me then a disassociation of reading characters from the stream at the lowest level and reading characters from the stream as a stream can pushed-back at least 1 character and consider that as "remains unread". The width limit of 2 does not save code as 3 characters can be read from the stream and 1 pushed back.

The width of 2 limits the maximum length of bytes to interpret, not a limit of the number of characters read at the lowest level.

Is it too much to hope that with unbuffered stdin, scanf might read no more than the amount of input I asked for?

Yes. If buffered or not, I think as a stream like stdin allows pushed-back of characters to consider them unread.

Anyways, "%2hhx" is brittle to expect not more than 2 characters read given leading white-space do not count. "These white-space characters are not counted against a specified field width."

The "I set stdin to be unbuffered" does not stop a stream from reading an excess character and later pushing it back.

Given "this extra character being read is detectable, and it happens to break the communications protocol" I recommend a new approach that does not use a stream.