I have launched, in astrophysics context, a large simulation (enzo code) with MPI execution on 128 cores, like this :
mpirun -np 128 ./enzo.exe amr_cosmology.enzo
and I get the following errors during the running : it is marked as a Hardware Error
, so I conclude that a stick of the Total RAM (1GB) is bad. As you can see, the code doesn't stop but these error messages occurs often during all the total run of code :
TopGrid dt = 3.705042e-02 time = 1.2350099725762 cycle = 14 z = 834.55610989934
TopGrid dt = 3.816191e-02 time = 1.272060395839 cycle = 15 z = 818.25224654732
TopGrid dt = 3.930675e-02 time = 1.3102223091899 cycle = 16 z = 802.26651295398
Message from syslogd@pablo at Sep 24 20:52:00 ...
kernel:[2415943.711318] [Hardware Error]: Corrected error, no action required.
Message from syslogd@pablo at Sep 24 20:52:00 ...
kernel:[2415943.711377] [Hardware Error]: CPU:2 (17:31:0) MC17_STATUS[-|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0x9c2041000000011b
Message from syslogd@pablo at Sep 24 20:52:00 ...
kernel:[2415943.711387] [Hardware Error]: Error Addr: 0x0000001c9f3d4ac0
Message from syslogd@pablo at Sep 24 20:52:00 ...
kernel:[2415943.711388] [Hardware Error]: IPID: 0x0000009600450f00, Syndrome: 0x0f5940000a801001
Message from syslogd@pablo at Sep 24 20:52:00 ...
kernel:[2415943.711399] [Hardware Error]: Unified Memory Controller Extended Error Code: 0
Message from syslogd@pablo at Sep 24 20:52:00 ...
kernel:[2415943.711407] [Hardware Error]: Unified Memory Controller Error: DRAM ECC error.
Message from syslogd@pablo at Sep 24 20:52:00 ...
kernel:[2415943.711422] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Message from syslogd@pablo at Sep 24 20:52:00 ...
kernel:[2415943.711474] [Hardware Error]: Corrected error, no action required.
Message from syslogd@pablo at Sep 24 20:52:00 ...
kernel:[2415943.711479] [Hardware Error]: CPU:2 (17:31:0) MC18_STATUS[Over|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0xdc2041000000011b
Message from syslogd@pablo at Sep 24 20:52:00 ...
kernel:[2415943.711483] [Hardware Error]: Error Addr: 0x0000001ee2f9b140
Message from syslogd@pablo at Sep 24 20:52:00 ...
kernel:[2415943.711484] [Hardware Error]: IPID: 0x0000009600550f00, Syndrome: 0xda9020000a800d01
Message from syslogd@pablo at Sep 24 20:52:00 ...
kernel:[2415943.711489] [Hardware Error]: Unified Memory Controller Extended Error Code: 0
Message from syslogd@pablo at Sep 24 20:52:00 ...
kernel:[2415943.711492] [Hardware Error]: Unified Memory Controller Error: DRAM ECC error.
Message from syslogd@pablo at Sep 24 20:52:00 ...
kernel:[2415943.711497] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
TopGrid dt = 4.048593e-02 time = 1.3495290567141 cycle = 17 z = 786.59270291163
TopGrid dt = 4.170048e-02 time = 1.3900149827028 cycle = 18 z = 771.22472945212
TopGrid dt = 4.295147e-02 time = 1.4317154617942 cycle = 19 z = 756.15662471201
What kind of error is this: is it automatically corrected or is it indeed an hardware failure? Anyway, something is wrong.
This is due to faulty RAM. Frequent ECC error correction such as in your case defines a faulty hardware. Fix is to find out the memory that causes this issue and replace it. If it's not a critical system, you might not need to fix it immediately.
In some instances, the RAM which is not working in it's expected frequency can also cause this issue.
See the references for more information. Ref 1, Ref 2, Ref 3