I understood that the --error-exitcode option can be used to cause valgrind to return a non-zero exit code when leaks are found. The valgrind documentation says:
--error-exitcode= [default: 0] Specifies an alternative exit code to return if Valgrind reported any errors in the run. When set to the default value (zero), the return value from Valgrind will always be the return value of the process being simulated. When set to a nonzero value, that value is returned instead, if Valgrind detects any errors. This is useful for using Valgrind as part of an automated test suite, since it makes it easy to detect test cases for which Valgrind has reported errors, just by inspecting return codes. When set to a nonzero value and Valgrind detects no error, the return value of Valgrind will be the return value of the program being simulated.
But when using valgrind on my one of my test programs with the option, the output clearly shows a leak, yet valgrind returns 0. What am I doing wrong?
ed@mikado:~/NCEPLIBS-g2/b/tests$ valgrind --error-exitcode=1 --leak-check=full --show-leak-kinds=all ./test_getgb2_mem_4 && echo $?
==72557== Memcheck, a memory error detector
==72557== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==72557== Using Valgrind-3.21.0 and LibVEX; rerun with -h for copyright info
==72557== Command: ./test_getgb2_mem_4
==72557==
Opening GRIB2 file data/gep19.t00z.pgrb2a.0p50_bcf144
getgb2 returned 99
getgb2 returned 99
getgb2 returned 99
getgb2 returned 99
getgb2 returned 99
Opening GRIB2 file data/geavg.t00z.pgrb2a.0p50_mecomf144
getgb2 returned 99
getgb2 returned 99
getgb2 returned 99
getgb2 returned 99
getgb2 returned 99
Opening GRIB2 file data/gec00.t00z.pgrb2a.0p50.f144
REC PD5 PD6 PD7 YEAR MN DY HR F1 F2 FU E2 E3 E4 LEN MAX MIN Sample
56 7 100 1000 23 4 30 0 144 0 0 1 2 1 259920 364.85 -458.59 165.82
47 7 100 925 23 4 30 0 144 0 0 1 2 1 259920 956.03 172.94 766.25
41 7 100 850 23 4 30 0 144 0 0 1 2 1 259920 1644.57 855.80 1413.74
36 7 100 700 23 4 30 0 144 0 0 1 2 1 259920 3223.11 2388.85 2870.44
31 7 100 500 23 4 30 0 144 0 0 1 2 1 259920 5910.78 4752.64 5280.56
Opening GRIB2 file data/gegfs.t00z.pgrb2a.0p50.f144
REC PD5 PD6 PD7 YEAR MN DY HR F1 F2 FU E2 E3 E4 LEN MAX MIN Sample
57 7 100 1000 23 4 30 0 144 0 10 1 1 1 259920 518.26 -490.32 161.32
48 7 100 925 23 4 30 0 144 0 10 1 1 1 259920 1021.39 147.21 762.75
42 7 100 850 23 4 30 0 144 0 10 1 1 1 259920 1692.23 830.58 1411.62
37 7 100 700 23 4 30 0 144 0 10 1 1 1 259920 3237.82 2361.94 2871.40
32 7 100 500 23 4 30 0 144 0 10 1 1 1 259920 5921.03 4759.57 5286.23
Opening GRIB2 file data/gegfs.t00z.pgrb2a.0p50_mef144
getgb2 returned 99
getgb2 returned 99
getgb2 returned 99
getgb2 returned 99
getgb2 returned 99
Opening GRIB2 file data/gep19.t00z.pgrb2a.0p50.f144
REC PD5 PD6 PD7 YEAR MN DY HR F1 F2 FU E2 E3 E4 LEN MAX MIN Sample
56 7 100 1000 23 4 30 0 144 0 0 3 19 1 259920 306.98 -454.66 93.58
47 7 100 925 23 4 30 0 144 0 0 3 19 1 259920 952.33 163.82 695.26
41 7 100 850 23 4 30 0 144 0 0 3 19 1 259920 1640.95 825.74 1343.70
36 7 100 700 23 4 30 0 144 0 0 3 19 1 259920 3214.48 2306.39 2803.36
31 7 100 500 23 4 30 0 144 0 0 3 19 1 259920 5913.73 4742.57 5224.79
==72557==
==72557== HEAP SUMMARY:
==72557== in use at exit: 300,000 bytes in 6 blocks
==72557== total heap usage: 4,268 allocs, 4,262 frees, 59,833,667 bytes allocated
==72557==
==72557== 300,000 bytes in 6 blocks are still reachable in loss record 1 of 1
==72557== at 0x4E050B5: malloc (vg_replace_malloc.c:431)
==72557== by 0x12E71A: getg2ir_ (getg2ir.f:55)
==72557== by 0x118150: getidx_ (getidx.F90:117)
==72557== by 0x115B4A: getgb2_ (getgb2.f:107)
==72557== by 0x10A950: gb2read_ (test_getgb2_mem.F90:165)
==72557== by 0x10DA04: MAIN__ (test_getgb2_mem.F90:43)
==72557== by 0x10DA86: main (test_getgb2_mem.F90:50)
==72557==
==72557== LEAK SUMMARY:
==72557== definitely lost: 0 bytes in 0 blocks
==72557== indirectly lost: 0 bytes in 0 blocks
==72557== possibly lost: 0 bytes in 0 blocks
==72557== still reachable: 300,000 bytes in 6 blocks
==72557== suppressed: 0 bytes in 0 blocks
==72557==
==72557== For lists of detected and suppressed errors, rerun with: -s
==72557== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
0
ed@mikado:~/NCEPLIBS-g2/b/tests$
Is it because valgrind does not regard this as an error because it is still reachable?
If so, how do I detect such leaks with valgrind and get a non-zero exit code, so that I can use my CI to run the tests?
Yes, by default, memory that is still reachable when the process terminates is not an error ("ERROR SUMMARY: 0 errors"). As the FAQ indicates, if you have memory that is still reachable...
your program is probably ok -- it didn't free some memory it could have. This is quite common and often reasonable.
Since at the end of the process, the operating system will reclaim all memory anyway, there's not really a big practical advantage to freeing memory yourself just before the program ends, though one could argue it is neater.
If you want Valgrind to treat reachable memory as an error, you can pass --errors-for-leak-kinds=all
in addition to your current flags:
$ valgrind --leak-check=full --show-leak-kinds=all --errors-for-leak-kinds=all --error-exitcode=1 ./example
==103420== Memcheck, a memory error detector
==103420== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==103420== Using Valgrind-3.21.0 and LibVEX; rerun with -h for copyright info
==103420== Command: ./example
==103420==
==103420==
==103420== HEAP SUMMARY:
==103420== in use at exit: 123 bytes in 1 blocks
==103420== total heap usage: 1 allocs, 0 frees, 123 bytes allocated
==103420==
==103420== 123 bytes in 1 blocks are still reachable in loss record 1 of 1
==103420== at 0x484182F: malloc (vg_replace_malloc.c:431)
==103420== by 0x401133: main (in /home/foo/example)
==103420==
==103420== LEAK SUMMARY:
==103420== definitely lost: 0 bytes in 0 blocks
==103420== indirectly lost: 0 bytes in 0 blocks
==103420== possibly lost: 0 bytes in 0 blocks
==103420== still reachable: 123 bytes in 1 blocks
==103420== suppressed: 0 bytes in 0 blocks
==103420==
==103420== For lists of detected and suppressed errors, rerun with: -s
==103420== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
$ echo $?
1