How to distribute portable MPI applications?

I am working for a simulation software vendor. We are now starting to implement distributed computing with MPI for our software. I don't really understand how we should distribute our MPI capable software product.

So, MPI is a interface specification, so the actual MPI implementation should be replaceable, right? Whoever runs the cluster can provide a very specialized MPI implementation for the hardware/communication layer they use. This makes sense to me.

On the other hand, when I run ldd mympiapp I see

libmpi.so.12 => /home/mpiuser/mpich-3.2-install/lib/libmpi.so.12 (0x00007fae34684000)

It seems that after building, my application is linked against my specific version of MPI. We already ship our application in different versions for different OSes. Should we now also add combinations for different MPI implementations? Or should we also distribute the shared libraries together with our application? What is expected from the side of users/cluster providers?

I read a lot of web resources, but most stuff I find is written from the standpoint that the one who compiles it also runs it.

Solution

There's a reason MPI implementations come with mpicc.

High-performance software differs from ordinary software in that performance is absolutely critical. Compiling a single binary for distribution is generally not acceptable, as hardware abstractions are leaky in terms of high performance.

Many of vendors of large scale high performance software distribute it either via a collection of different binaries for various hardware/software combinations, send an engineer(s) on-site to compile and tune the software for the customer's system, or in some cases I've heard of smaller companies that give the source code to the customer (with very strict contracts).

Three reasons why it needs to be compiled specifically for the customer's system:

So that the correct MPI and OpenMP implementations for the hardware are used,
So that a platform specific compiler can be used to generate the most efficient instructions possible,
So that tuning of compile-time algorithm parameters for the hardware (processors, memory, and interconnect) can be done. The communication pattern your code uses should depend on the interconnect, block sizes should depend on processor cache size, etc..

This need for coupled hardware and compiled bytes generally results in long sales cycles for commercial MPI software.