c: strategies for debugging obscure memory leaks?

I'm working on a project in c, and I'm trying to understand how to debug an obscure bug that crashes my program. Its kinda large, attempts to isolate the problem by making smaller versions of the code are not working. So I'm trying to come up with a way to debug and pinpoint a memory leak.

I came up with the following plan: I know the problem comes from running a certain function, and that function calls itself recursively. So I thought I could make a snapshot of sorts of my program memory allocation. Since I don't know jack about what happens under the hood (I know a little not enough to be useful in this situation):

typedef struct record_mem {
    int num_allocs;
    int num_frees;
    int size_space;
    int num_structure_1;
    ...
    int num_structure_N;
    int num_records;
    struct record_mem *next;
} RECORD;
extern RECORD *top;
void pushmem(RECORD **top)
{
    RECORD *nnew = 0;
    RECORD *nnew = (RECORD *)malloc(sizeof(RECORD));
    nnew->num_allocs=1;
    nnew->num_frees=0;
    nnew->size_space=sizeof(RECORD);
    nnew->num_structure_1=0;
    ...
    nnew->num_structure_N=0;
    nnew->num_records=1;
    nnew->next=0;
    if(*top)
    {
        nnew->num_allocs+=(*top)->num_allocs;
        nnew->num_frees=(*top)->num_frees;
        nnew->size_space+=(*top)->size_space;
            nnew->num_structure_1=(*top)->num_allocs;
            ...
            nnew->num_structure_N=(*top)->num_allocs;
            nnew->num_records+=(*top)->num_records;
        nnew->next=*top;
    }
    *top=nnew;
}

the idea being I print out the contents of my memory record keeping right before the moment my program crashes (I know where it crashes thanks to GDB).

and then throughout the program (for each data structure in my program I have a similar push function like above) I can simply add a one liner with a function tallying datastructure allocations plus total stack (heap?) memory allocations (that I can keep track of). I simply make more memory_record structures wherever I feel the need to record a snapshot of my program running. Problem is this memory balance sheet recording won't helping if I can't somehow record how much memory is actually being used.

But how would I do this? Plus how would I take dangling pointers and leaks into account? I'm using OS X and I'm currently looking up how I could record the stack pointer and other things.

EDIT: Since you asked: output of valgrind: (closure() is the function called from main that returns the bad pointer: Its supposed to return the head of a doubly linked-list, traversehashmap() is a function called from closure() I use to calculate and append extra node to the linked-list and which calls itself recursively because it needs to jump around between nodes.)

jason-danckss-macbook:project Jason$ valgrind --leak-check=full --tool=memcheck ./testc
Will attempt to compute closure of AB:
Result: testcl: 0x10000d0b0
==7682== Invalid read of size 8
==7682==    at 0x100001D4E: printrelation2 (relation.h:490)
==7682==    by 0x100003CFE: main (test-computation.c:47)
==7682==  Address 0x10000cee8 is 8 bytes inside a block of size 24 free'd
==7682==    at 0xD828: free (vg_replace_malloc.c:450)
==7682==    by 0x100001232: destroyrelation2 (relation.h:161)
==7682==    by 0x100003407: destroyallhashmap (computation.h:333)
==7682==    by 0x1000039E1: closure (computation.h:539)
==7682==    by 0x100003CBE: main (test-computation.c:38)
==7682== 
==7682== 
==7682== HEAP SUMMARY:
==7682==     in use at exit: 5,360 bytes in 48 blocks
==7682==   total heap usage: 99 allocs, 51 frees, 6,640 bytes allocated
==7682== 
==7682== 48 (24 direct, 24 indirect) bytes in 1 blocks are definitely lost in loss record 33 of 37
==7682==    at 0xC283: malloc (vg_replace_malloc.c:274)
==7682==    by 0x100001104: getnewrelation (relation.h:134)
==7682==    by 0x100001848: copyrelation (relation.h:343)
==7682==    by 0x100003991: closure (computation.h:531)
==7682==    by 0x100003CBE: main (test-computation.c:38)
==7682== 
==7682== 1,128 (24 direct, 1,104 indirect) bytes in 1 blocks are definitely lost in loss record 36 of 37
==7682==    at 0xC283: malloc (vg_replace_malloc.c:274)
==7682==    by 0x100002315: getnewholder (dependency.h:129)
==7682==    by 0x100003B17: main (test-computation.c:14)
==7682== 
==7682== LEAK SUMMARY:
==7682==    definitely lost: 48 bytes in 2 blocks
==7682==    indirectly lost: 1,128 bytes in 44 blocks
==7682==      possibly lost: 0 bytes in 0 blocks
==7682==    still reachable: 4,096 bytes in 1 blocks
==7682==         suppressed: 88 bytes in 1 blocks
==7682== Reachable blocks (those to which a pointer was found) are not shown.
==7682== To see them, rerun with: --leak-check=full --show-reachable=yes
==7682== 
==7682== For counts of detected and suppressed errors, rerun with: -v
==7682== ERROR SUMMARY: 3 errors from 3 contexts (suppressed: 0 from 0)

Solution

From your valgrind output:

This is likely what causes your problem:

==7682== Invalid read of size 8
==7682==    at 0x100001D4E: printrelation2 (relation.h:490)
==7682==    by 0x100003CFE: main (test-computation.c:47)
==7682==  Address 0x10000cee8 is 8 bytes inside a block of size 24 free'd
==7682==    at 0xD828: free (vg_replace_malloc.c:450)
==7682==    by 0x100001232: destroyrelation2 (relation.h:161)
==7682==    by 0x100003407: destroyallhashmap (computation.h:333)
==7682==    by 0x1000039E1: closure (computation.h:539)
==7682==    by 0x100003CBE: main (test-computation.c:38)

Let's go in depth

==7682== Invalid read of size 8
==7682==    at 0x100001D4E: printrelation2 (relation.h:490)
==7682==    by 0x100003CFE: main (test-computation.c:47)

This is a summary for your error. You access an unallocated (or previously allocated then deallocated) memory location of 8 bytes in printrelation2, at line 490 of relation.h.

==7682==  Address 0x10000cee8 is 8 bytes inside a block of size 24 free'd

The accessed address is 8 bytes long inside a block of size 24, ie probably a field of size 8 bytes in a structure of size 24 (look for such a structure), an you previously freed this address.

==7682==    at 0xD828: free (vg_replace_malloc.c:450)
==7682==    by 0x100001232: destroyrelation2 (relation.h:161)
==7682==    by 0x100003407: destroyallhashmap (computation.h:333)
==7682==    by 0x1000039E1: closure (computation.h:539)
==7682==    by 0x100003CBE: main (test-computation.c:38)

This is the stack of the calls that resulted in freeing the address you referenced at the point where your program crashes. It starts with a free, which is normal because you probably used the free function to deallocate memory. However the file and line are the standard library, so not very relevant. What is relevant though is that this free is called from destroyrelation2 at line 161 in relation.h, and this is the faulty free. destroyrelation2 itself is called by destroyallhashmap, which is called by closure, which is called by main at line 38 of test-computation.c. You need to find out what mistake in your allocations causes that you reuse a pointer in printrelation2 that you freed previously in main at line 38.

The memory leak reported afterwards exists, but is not likely what causes your crash.

Is valgrind output clearer now?

Note 1: This memory leaks report may changes after you fix your segfault, but as it is now, here's how I interpret it:

==7682== 48 (24 direct, 24 indirect) bytes in 1 blocks are definitely lost in loss record 33 of 37
==7682==    at 0xC283: malloc (vg_replace_malloc.c:274)
==7682==    by 0x100001104: getnewrelation (relation.h:134)
==7682==    by 0x100001848: copyrelation (relation.h:343)
==7682==    by 0x100003991: closure (computation.h:531)
==7682==    by 0x100003CBE: main (test-computation.c:38)
==7682== 
==7682== 1,128 (24 direct, 1,104 indirect) bytes in 1 blocks are definitely lost in loss record 36 of 37
==7682==    at 0xC283: malloc (vg_replace_malloc.c:274)
==7682==    by 0x100002315: getnewholder (dependency.h:129)
==7682==    by 0x100003B17: main (test-computation.c:14)
==7682== 
==7682== LEAK SUMMARY:
==7682==    definitely lost: 48 bytes in 2 blocks
==7682==    indirectly lost: 1,128 bytes in 44 blocks
==7682==      possibly lost: 0 bytes in 0 blocks
==7682==    still reachable: 4,096 bytes in 1 blocks
==7682==         suppressed: 88 bytes in 1 blocks

Let's start with the summary:

==7682== LEAK SUMMARY:
==7682==    definitely lost: 48 bytes in 2 blocks
==7682==    indirectly lost: 1,128 bytes in 44 blocks
==7682==      possibly lost: 0 bytes in 0 blocks
==7682==    still reachable: 4,096 bytes in 1 blocks
==7682==         suppressed: 88 bytes in 1 blocks

You have two blocks of allocated memory which are not accessible through any pointers. That means that somewhere in the program, you malloc them and at some later point you totally forget about them. Those are bad memory leaks. You need to review your logic in order to keep a handle on these blocks, or to free them sooner in the program life. I'm unsure about indirectly lost, I'd say you don't have direct handles to your blocks, but you have pointers to structures that owns handles to the blocks. Those memory leaks can be mitigated by freeing the pointers in the structures before exit. I don't know about "possibly lost" and never had one with valgrind. "still reachable" are good memory leaks, ie at the point where valgrind crashed, you did not freed the still reachable block, but you have a pointer to it and you can easily add a call to free that pointer and solve the memory leak.

The two call stacks show you the malloc that are result in memory leaks, minus the "still reachable" leaks (to see them, you must add the options --leak-check-full --show-reachable=yes to your valgrind invocation.

Note 2: Avoid function names like destroyallhashmap (hard to read) or destroyrelation2 (numbered). Prefer destroy_all_hashmap or the less usual (in C) destroyAllHashmap and avoid numbering your functions. Similarly, avoid variable names like nnew, but use semantically sensible variable names.