linux performance parallel-processing linux-kernel parallelism-amdahl

Parallel computing: how to share computing resources among users?

I am running a simulation on a Linux machine with the following specs.

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                80
On-line CPU(s) list:   0-79
Thread(s) per core:    2
Core(s) per socket:    20
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
Stepping:              4
CPU MHz:               3099.902
CPU max MHz:           3700.0000
CPU min MHz:           1000.0000
BogoMIPS:              4800.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              28160K

This is the run command line script for my solver.

/path/to/meshfree/installation/folder/meshfree_run.sh    # on 1 (serial) worker
/path/to/meshfree/installation/folder/meshfree_run.sh N  # on N parallel MPI processes

I share the system with another colleague of mine. He uses 10 cores for his solution. What would be the fastest option for me in this case? Using 30 MPI processes?

I am a Mechanical Engineer with very little knowledge on parallel computing. So please excuse me if the question is too stupid.

Solution

Q : "What would be the fastest option for me in this case? _{...running short on time. I am already in the middle of a simulation.}"

Salutes to Aachen. If it were not for the ex-post remark, the fastest option would be to pre-configure the computing eco-system so that:

check full details of your NUMA device - using lstopo, or lstopo-no-graphics -.ascii not the lscpu
initiate your jobs having as many as possible MPI-worker processes mapped on physical (and best each one exclusively mapped onto its private) CPU-core ( as these deserve this as they carry the core FEM / meshing processing workload )
if your FH policy does not forbid one doing so, you may ask system administrator to introduce CPU-affinity mapping ( that will protect your in-cache data from eviction and expensive re-fetches, that would make 10-CPUs mapped exclusively for use by your colleague and the said 30-CPUs mapped exclusively for your application runs and the rest of the listed resources ~ the 40-CPUs ~ being "shared"-for-use by both, by your respective CPU-affinity masks.

Q : "Using 30 MPI processes?"

No, this is not a reasonable assumption for ASAP processing - use as many CPUs for workers, as possible for an already MPI-parallelised FEM-simulations ( they have high degree of parallelism and most often a by-nature "narrow"-locality ( be it represented as a sparse-matrix / N-band-matrix ) solvers, so the parallel-portion is often very high, compared to other numerical problems ) - the Amdahl's Law explains why.

Sure, there might be some academic-objections about some slight difference possible, for cases, where the communication overheads might got slightly reduced on one-less worker(s), yet the need for a brute-force processing rules in FEM/meshed-solvers ( communication costs are typically way less expensive, than the large-scale, FEM-segmented numerical computing part, sending but a small amount of neighbouring blocks' "boundary"-node's state data )

The htop will show you the actual state ( may note process:CPU-core wandering around, due to HT / CPU-core Thermal-balancing tricks, that decrease the resulting performance )

And do consult the meshfree Support for their Knowledge Base sources on Best Practices.

Next time - the best option would be to acquire a less restrictive computing infrastructure for processing critical workloads ( given a business-critical conditions consider this to be the risk of smooth BAU, the more if impacting your business-continuity ).