High-speed calculation of Coq's theorems

I have to wait until Coq finish its computations even in very simple cases.

I know about "Asynchronous and Parallel Proof Processing", but I suppose that my code has inherent vices, so I'd like to get some references or advices to guidelines/best practices of proofs' styling. e.g.:

try to use Definitions instead of Theorems,
Use compiler. Use parrallel processing. Use better hardware.
Do not use placeholders, fill every argument, like (@functionname var1 ... varn)
semicolons(;) instead of period(.)
It is much faster to use Definitions in Section instead of "set (f:=term)." in the proof. (possibly because every "set" get additional time to be printed. Even to Check Empty.)

How to accelerate Coq? (Please say if I have an error in items above. They are derived from my practice.)

What are the most critical stages of calculating and how to work with them?

Solution

The holy grail is to profile your code and optimize the hotspots. In Coq >= 8.6, you can Set Ltac Profiling. and Reset Ltac Profile. and Show Ltac Profile. to profile tactics. If you invoke coqc with the -time argument, you will get line-by-line timing information; a bit of sed trickery can sort the information for you:

coqc -time foo.v | sed s'/^\(.*\) \([0-9\.]\+ secs.*\)$/\2 \1/g' | sort -h

In Coq >= 8.8, you can Set NativeCompute Profiling to profile native computations (such as Eval native_compute in (slow program here)). This produces traces that you can visualize with perf report on GNU/Linux. See this bug report for more info.

If you hit other bottlenecks, you either have to be good at guessing, convince the devs to add more profiling tools, profile the running coq binary, or convince the devs that it's worth their time to profile your code for you. (I can sometimes get Pierre-Marie Pédrot to profile my code when the slowness points to an efficiency bug in Coq.)

One useful practice is to always profile your code on every single commit. In Coq >= 8.7, there are Makefile targets make-pretty-timed-before, make-pretty-timed-after, and print-pretty-timed-diff to get nice sorted tables of per-file compilation time diffs between two states of your repo. You can get per-line information with make TIMING=before, make TIMING=after, make all.timing.diff.

You may also be interested in looking at Experience Implementing a Performant Category-Theory Library in Coq (more media here), and perhaps also the slide deck for the presentation of that paper (pdf) (pptx with presenter notes).

Coq can be slow in a number of places, though most of the things you mention are irrelevant. Going through yours in order:

Theorem and Definition are synonyms, the only difference is that Definition supports := and Theorem doesn't.
There are no optimizing compilers for Coq, though better hardware, more RAM, faster CPUs, definitely help. As does parallel processing. On this note, file-level parallelism tends to work better than proof-level parallelism. I tend to split my files up as much as possible, have fine-grained imports so that the library loading time isn't an issue, and use make -j.
Filling every argument is more likely to hurt than to help. You incur additional unification costs, especially if the terms you are filling the arguments with are big. The real thing you do by filling in the arguments is trading evar creation against unification. Unification is usually slower. However, if you have any holes that are filled by canonical structure resolution, by typeclass resolution, or which require other backtracking and unfolding things or refreshing universe variables, filling them in can speed things up a lot.
I think the difference between semicolon and period in proof scripts only matters in interactive mode (in coqtop/CoqIDE/ProofGeneral/etc, not when running make). Let me know if you test it and discover otherwise.
This is very true and has impacts on both printing and other things. The set itself would not be slow due to printing, but instead due to the fact that it tries to find all occurrences of the term in your goal (stupidly, up to some reduction (beta-iota? I can't recall), rather than up to syntactic equality), and replace them with the new hypothesis. If you don't need this behavior, use pose rather than set. Additionally, big context variables can slow down tactics that depend on the size of the context, and this is why Definition is faster in sections.

Other thoughts on things I've run into:

Pick good abstraction barriers, and respect them religiously. You will pay in sweat and tears every single time you break an abstraction barrier. (Picking good abstraction barriers is incredibly hard. Usually I do this by copying existing mathematical abstractions or published papers. I've managed to generate a good abstraction barrier entirely from scratch, in some sense, exactly once in the past five years. The abstraction barrier, in hindsight, can be described with the "insight" that "when writing C-like code, every function takes one tuple as an argument and returns one tuple as a result.")
If you're doing reflective proofs, pick good algorithms. If you use unary natural numbers, or your algorithm has exponential running time, you're proof will be slow.
In Coq < 8.7, evar-map normalization incurs massive overhead on large terms. (Props to Pierre-Marie for fixing this with his EConstr branch.)
Slow End Section (and sometimes slow Defineds) are frequently caused by issues with rehashconsing. To fix this, cross your fingers and pray. (Or become a Coq dev and fix Coq.)
The guardedness checker is extremely slow at checking bare fixes with large bodies (maybe only in the presence of beta-iota-zeta reduxes?). The work-around is to extract the body into a separate definition. For example, rather than writing

Fixpoint length A (ls : list A) : nat :=
  match ls with
  | nil => 0
  | cons _ xs => S (length _ xs)
  end.

you could write

Definition length_step
             (length : forall A, list A -> nat)
             A (ls : list A)
    : nat
    := match ls with
       | nil => 0
       | cons _ xs => S (length _ xs)
       end.
Fixpoint length A (ls : list A) : nat
  := length_step length A ls.

Be aware that Coq will liberally (sometimes inconsistently) inline let x := ... in ... statements, which can easily lead to exponential definition internalization time.
When doing reification, canonical structures are fast but hard to read, Ltac is about 2x slower, and typeclass resolution can be 2x slower again (or can be the same speed as Ltac reification). Hopefully things will be better when Ltac2 is finished.
simpl is very slow on large terms

I may come back and add more later (and feel free to suggest things in the comments for me to add), but this is a half-decent start.