What will be number of tokens in following ?
int a[2][3];
I think tokens are -> {'int', '[', ']', '[', ']', ';'}
Can someone explain what to consider and what not while compiler calculates tokens ?
Thanks
Expanding on my comment:
How the input is tokenized is a function of your tokenizer (scanner). In principle, the input you presented might be tokenized as "int"
, "a"
, "[2]"
, "[3]"
, ";"
, for example. In practice, the most likely choice of tokenization would be "int"
, "a"
, "["
, "2"
, "]"
, "["
, "3"
, "]"
, ";"
. I am uncertain why you seem to think that the variable name and dimension values would not be represented among the tokens -- they carry semantic information and therefore must not be left out.
Although separating compiling into a lexical analysis step and a semantic analysis step is common and widely considered useful, it is not inherently essential to make such a separation at all. Where it is made, the choice of tokenization is up to the compiler. One ordinarily chooses tokens so that each represents a semantically significant unit, but there is more than one way to do that. For instance, my alternative example corresponds to a token sequence that might be characterized as
IDENTIFIER, IDENTIFIER, DIMENSION, DIMENSION, TERMINATOR
The more likely approach might be characterized as
IDENTIFIER, IDENTIFIER, OPEN_BRACKET, INTEGER, CLOSE_BRACKET, OPEN_BRACKET,
INTEGER, CLOSE_BRACKET, TERMINATOR
The questions to consider include
int
into two tokens, because such tokens do not represent a complete semantic unit.DIMENSION
instead of OPEN_BRACKET, INTEGER, CLOSE_BRACKET
)Updated to add:
The C standard does define the post-preprocessing language in terms of a specific tokenization, which for the statement you gave would be the "most likely" alternative I specified (and that's one reason why it's the most likely). I have answered the question in a more general sense, however, in part because it is tagged [compiler-construction].