I have started recently learning parallel programming techniques and what to give attention to when trying to create efficient programs. For example knowing specific details about the caches of your processor is essential if you want to write efficient programs.
I want to know what is the most important (if one is more important than the other) feature of a cache between the block size and the number of sets e.g. 4-way or 8-way associative.
Associativity matters more than line size. Many accesses in HPC are sequential, so smaller line size is mostly just a waste of tag overhead.
Having more smaller sets (because of a smaller line size) might help for a histogram problem, which is one of the major things that can't easily be optimized to sequential accesses.
Of course, latency and bandwidth are usually even more important than 4 vs. 8-way.