Search code examples
cgccclangc-preprocessor

Preprocessor only on arbitrary file?


I wanted to demonstrate that the preprocessor is totally independant of the build process. It is another grammar and another lexer than the C language. In fact I wanted to show that the preprocessor could be applied to any type of file.

So I have this arbitrary file:

#define FOO    
#ifdef FOO    
I am foo    
#endif    
#
#
#    
Something    
#pragma Hello World

And I thought this would work:

$ gcc -E test.txt -o-
gcc: warning: test.txt: linker input file unused because linking not done

Unfortunately it only work with this:

$ cat test.txt | gcc -E -

Why is this error with GCC?


Solution

  • The C compiler uses the file name suffix as an indicator of the files that have to be compiled (ending in .c) files that have only to be linked (ending in .o or .so) For the files ending in .s it calls the assembler as(1) and for files ending in .f it call the fortran compiler, and for .cc it switches to C++ compiling.

    Indeed, normally, C compilers take everything they don't match as a linker file, so once you pass it a linker file, it tries to link it, calling the linker ld(1). This is what happens with your .txt file. The linker has some similar way to recognise ld(1) scripts against object or shared object files.

    BTW, the CPP language is indeed a macro language, but there's some similarities with C that cannot be avoided. It has, at least, to recognise C identifiers, as macro names have the same syntax as C identifiers, and it has to check that an identifier matches a macro name or not. In other side... It has to recognise C comments, and C strings (it indeed eliminates comments for the compiler), as macro expansion doesn't enter to expand inside them, and it has also to recognize parenthesis (they are considered for macro parameter detection and the , symbol, used to separate parameters). It also recognizes (inside the macro string) the tokens # (to stringify a parameter) and ## (to catenate and merge two symbols into one) (this last operator must force cpp to recognise almost any C token, as it must check for errors if you try to merge something like +##+ into ++, which is an error)

    So, the conclussion is: the cpp doesn't have to implement the whole C syntax as a language, but the tokens of the C language must be recognised almost completely. The standard for the C language forces the c preprocessor to tokenize the input, so the ## operator can be used to merge tokens (and to check for validity) This means that, if you define a macro like:

    #define M(p) +p
    

    and then you call it like:

    a = +M(-c);
    

    you will get a string similar to:

    a = + +-c;
    

    in the output (it will insert a space in between the two + signs, so they don't get merged into ++ operator. The symbols + and - are together, because they will never be scanned as one token) See the next example (input is preceded by > symbol)

    $ cpp - <<EOF
    > #define M(p) +p
    > a = +M(p);
    > b = -M(p);
    > p = +M(+p);
    > p = +M(-p);
    > EOF
    # 1 "<stdin>"
    # 1 "<built-in>" 1
    # 1 "<built-in>" 3
    # 346 "<built-in>" 3
    # 1 "<command line>" 1
    # 1 "<built-in>" 2
    # 1 "<stdin>" 2
    
    a = + +p;
    b = -+p;
    p = + + +p;
    p = + +-p;
    

    Another example will show more difficulties in parsing the tokens (input is delimited with >, stderr with >> and stdout is unquoted):

    $ cpp - <<EOF
    #define M(a,b) a##b
    > a = M(a+,+b)
    > a = M(a+,-b)
    > a = M(a,+b)
    > a = M(a,b)
    > a = M(a,300)
    > a = M(a,300.2)
    > EOF
    >> <stdin>:3:5: error: pasting formed '+-', an invalid preprocessing token
    >> a = M(a+,-b)
    >>     ^
    >> <stdin>:1:17: note: expanded from macro 'M'
    >> #define M(a,b) a##b
    >>                 ^
    >> <stdin>:4:5: error: pasting formed 'a+', an invalid preprocessing token
    >> a = M(a,+b)
    >>     ^
    >> <stdin>:1:17: note: expanded from macro 'M'
    >> #define M(a,b) a##b
    >>                 ^
    >> <stdin>:7:5: error: pasting formed 'a300.2', an invalid preprocessing token
    >> a = M(a,300.2)
    >>     ^
    >> <stdin>:1:17: note: expanded from macro 'M'
    >> #define M(a,b) a##b
    >>                 ^
    >> 3 errors generated.
    # 1 "<stdin>"
    # 1 "<built-in>" 1
    # 1 "<built-in>" 3
    # 346 "<built-in>" 3
    # 1 "<command line>" 1
    # 1 "<built-in>" 2
    # 1 "<stdin>" 2
    
    a = a++b
    a = a+-b
    a = a+b
    a = ab
    a = a300
    a = a 300.2
    

    As you can see in this example, merging a and 300 goes fine, as one token makes an identifier, which is valid and cpp(1) doesn't complain, but when merging a and 300.2 the resulting token a300.2 is not a valid token in C, so it is rejected (it is also not joined and the tool inserts a space, to make the compiler see both tokens as separate ---should it joined both together, they would have been scanned as the tokens a300 and .2).

    If you want to use a language independent macro preprocesor, consider using m4(1) as a macro language. It's far more powerful than cpp in many ways. But beware, it's difficult to learn due to the complexity of macro expansions it allows.