Tuning Mathematical Parallel Codes

Assuming that I am interested in performance rather than portability of my linear algebra iterative multi-threaded solver and that I have the results of profiling my code in hand, how do I go about tuning my code to run optimally on that machine of my choice?

The algorithm involves Matrix-Vector multiplications, norms and dot-products. (FWIW, I am working on CG and GMRES).

I am working on codes which are of matrix size roughly equivalent to the full size of the RAM (~6GB). I'll be working on Intel i3 Laptop. I'll be linking my codes using Intel MKL.

Specifically,

Is there a good resource(PDF/Book/Paper) for learning manual tuning? There are numerous things that I learnt by doing for instance : Manual Unrolling isn't always optimal or about compiler flags but I would prefer a centralized resource.
I need something to translate profiler information to improved performance. For instance, my profiler tells me that my stacks of one processor are being accessed by another or that my mulpd ASM is taking too much time. I have no clue what these mean and how I could use this information for improving my code.

My intention is to spend as much time as needed to squeeze as much compute power as possible. Its more of a learning experience than for actual use or distribution as of now.

(I am concerned about manual tuning not auto-tuning)

Misc Details:

This differs from usual performance tuning since the major portions of the code are linked to Intel's proprietary MKL library.
Because of Memory Bandwidth issues in O(N^2) matrix-vector multiplications and dependencies, there is a limit to what I could manage on my own through simple observation.
I write in C and Fortran and I have tried both and as discussed a million times on SO, I found no difference in either if I tweak them appropriately.

Solution

Gosh, this still has no answers. After you've read this you'll still have no useful answers ...

You imply that you've already done all the obvious and generic things to make your codes fast. Specifically you have:

chosen the fastest algorithm for your problem (either that, or your problem is to optimise the implementation of an algorithm rather than to optimise the finding of a solution to a problem);
worked your compiler like a dog to squeeze out the last drop of execution speed;
linked in the best libraries you can find which are any use at all (and tested to ensure that they do in fact improve the performance of your program;
hand-crafted your memory access to optimise r/w performance;
done all the obvious little tricks that we all do (eg when comparing the norms of 2 vectors you don't need to take a square root to determine that one is 'larger' than another, ...);
hammered the parallel scalability of your program to within a gnat's whisker of the S==P line on your performance graphs;
always executed your program on the right size of job, for a given number of processors, to maximise some measure of performance;

and still you are not satisfied !

Now, unfortunately, you are close to the bleeding edge and the information you seek is not to be found easily in books or on web-sites. Not even here on SO. Part of the reason for this is that you are now engaged in optimising your code on your platform and you are in the best position to diagnose problems and to fix them. But these problems are likely to be very local indeed; you might conclude that no-one else outside your immediate research group would be interested in what you do, I know you wouldn't be interested in any of the micro-optimisations I do on my code on my platform.

The second reason is that you have stepped into an area that is still an active research front and the useful lessons (if any) are published in the academic literature. For that you need access to a good research library, if you don't have one nearby then both the ACM and IEEE-CS Digital Libraries are good places to start. (Post or comment if you don't know what these are.)

In your position I'd be looking at journals on 2 topics: peta- and exa-scale computing for science and engineering, and compiler developments. I trust that the former is obvious, the latter may be less obvious: but if your compiler already did all the (useful) cutting-edge optimisations you wouldn't be asking this question and compiler-writers are working hard so that your successors won't have to.

You're probably looking for optimisations which like, say, loop unrolling, were relatively difficult to find implemented in compilers 25 years ago and which were therefore bleeding-edge back then, and which themselves will be old and established in another 25 years.

EDIT

First, let me make explicit something that was originally only implicit in my 'answer': I am not prepared to spend long enough on SO to guide you through even a summary of the knowledge I have gained in 25+ years in scientific/engineering and high-performance computing. I am not given to writing books, but many are and Amazon will help you find them. This answer was way longer than most I care to post before I added this bit.

Now, to pick up on the points in your comment:

on 'hand-crafted memory access' start at the Wikipedia article on 'loop tiling' (see, you can't even rely on me to paste the URL here) and read out from there; you should be able to quickly pick up the terms you can use in further searches.
on 'working your compiler like a dog' I do indeed mean becoming familiar with its documentation and gaining a detailed understanding of the intentions and realities of the various options; ultimately you will have to do a lot of testing of compiler options to determine which are 'best' for your code on your platform(s).
on 'micro-optimisations', well here's a start: Performance Optimization of Numerically Intensive Codes. Don't run away with the idea that you will learn all (or even much) of what you want to learn from this book. It's now about 10 years old. The take away messages are:
- performance optimisation requires intimacy with machine architecture;
- performance optimisation is made up of 1001 individual steps and it's generally impossible to predict which ones will be most useful (and which ones actually harmful) without detailed understanding of a program and its run-time environment;
- performance optimisation is a participation sport, you can't learn it without doing it;
- performance optimisation requires obsessive attention to detail and good record-keeping.

Oh, and never write a clever piece of optimisation that you can't easily un-write when the next compiler release implements a better approach. I spend a fair amount of time removing clever tricks from 20-year old Fortran that was justified (if at all) on the grounds of boosting execution performance but which now just confuses the programmer (it annoys the hell out of me too) and gets in the way of the compiler doing its job.

Finally, one piece of wisdom I am prepared to share: these days I do very little optimisation that is not under one of the items in my first list above; I find that the cost/benefit ratio of micro-optimisations is unfavourable to my employers.