My team at the university is programming a compiler in the C language. The compiler gets source code in the sublanguage of Goland and outputs byte code in the language that is similar to assembler. My question is – what approach is more effective, read source file char by char (getc) and depending on the current character change the state of FSM or read by chunks (fgets) and call auxiliary functions that contain FSMs to process single lexemes and output tokens?
My team at the university is programming a compiler in the C language.
What is more effective in my case read by char (getc) or read by chunks (fgets)
Since the CPU is a lot faster than most disks, including SSD ones. And because of the page cache.
Write your application, debug it, enable optimizations in your compiler (the one compiling your C code), then benchmark it. It is operating system and compiler specific.
On Linux, read documentations of GCC, GDB, gprof(1), time(1), syscalls(2), fopen(3), fflush(3), setvbuf(3), fgetc(3), fgets(3), getline(3), readline(3), time(7) and Advanced Linux Programming
If you happen to use GCC, study its source code for inspiration, and read the documentation of its internals, and invoke it as gcc -O2 -ftime-report -ftime-report-details
; you'll discover that most of the time spent by a compiler is on optimizations, not on parsing.
You should consider metaprogramming approaches: use (or develop) for your compiler project programs generating C programs (inspired by ANTLR, GNU bison, iburg, SWIG, RPCGEN, etc...). Observe that GCC has a dozen of specialized domain specific languages and corresponding code generators. Consider perhaps using GPP (or, in november 2020, RefPerSys) for such purposes.
Most of your compiler should transform internal compiler representations (e.g. simplified abstract syntax trees, perhaps inspired by GIMPLE, and you might consider using libgccjit if your teacher allows it)
Read of course the Dragon book.