Search code examples
ccompiler-constructiontoken

What will be number of tokens(compiler)?


What will be number of tokens in following ?

int a[2][3];

I think tokens are -> {'int', '[', ']', '[', ']', ';'}

Can someone explain what to consider and what not while compiler calculates tokens ?

Thanks


Solution

  • Expanding on my comment: How the input is tokenized is a function of your tokenizer (scanner). In principle, the input you presented might be tokenized as "int", "a", "[2]", "[3]", ";", for example. In practice, the most likely choice of tokenization would be "int", "a", "[", "2", "]", "[", "3", "]", ";". I am uncertain why you seem to think that the variable name and dimension values would not be represented among the tokens -- they carry semantic information and therefore must not be left out.

    Although separating compiling into a lexical analysis step and a semantic analysis step is common and widely considered useful, it is not inherently essential to make such a separation at all. Where it is made, the choice of tokenization is up to the compiler. One ordinarily chooses tokens so that each represents a semantically significant unit, but there is more than one way to do that. For instance, my alternative example corresponds to a token sequence that might be characterized as

    IDENTIFIER, IDENTIFIER, DIMENSION, DIMENSION, TERMINATOR
    

    The more likely approach might be characterized as

    IDENTIFIER, IDENTIFIER, OPEN_BRACKET, INTEGER, CLOSE_BRACKET, OPEN_BRACKET,
            INTEGER, CLOSE_BRACKET, TERMINATOR
    

    The questions to consider include

    • What units of the source contain meaningful semantic information in their own right? For instance, it is not useful to make each character a separate token or to split up int into two tokens, because such tokens do not represent a complete semantic unit.
    • How much responsibility you can or should put on the lexical analyzer (for instance, to understand the context enough to present DIMENSION instead of OPEN_BRACKET, INTEGER, CLOSE_BRACKET)

    Updated to add:

    The C standard does define the post-preprocessing language in terms of a specific tokenization, which for the statement you gave would be the "most likely" alternative I specified (and that's one reason why it's the most likely). I have answered the question in a more general sense, however, in part because it is tagged [compiler-construction].