I understand the idea behind killer heuristic and why it helps. What I am struggling with is how to implement it in an Alpha-Beta search routine. In particular, how to ensure that only the killer moves of sibling nodes are tried first? Pseudo code will be of great help.
A killer move is likely to be a good refutation move at many parts of the search tree. So trying killer moves for other positions might be a good idea.
You may have different arrays of killer moves for each ply and clear the array for (ply+1) when entering (ply), this will give you a 'siblings-only' killers. I'd say that the '+1' offset is something like a parameter to tune: try clearing the array for (ply+N) instead and see where the performance is the best (I guess increasing N would improve things).
By the way, evaluation of engine's performance is a different rather resource-consuming task: usually an engine tries to solve a test suite, or two engines with different parameters are playing matches (the matches should be long enough, e.g. 100 games) between themselves to determine which is better.