Search code examples
cgarbage-collectionprinciplesboehm-gc

How does Boehm GC work for C program?


I checked Boehm GC. The GC for C/C++.

I know mark-and-sweep algorithm. What I'm in curious is how it picks up only pointers in whole C memory. My understanding about C memory is just a plain byte array. Is it possible to determine a value in memory is pointer or not?


Solution

  • The Boehm GC is a conservative collector, which means it assumes everything (on pointer boundaries) is a pointer. This means that it can find false positive references, like an integer which coincidentally has the value of an address in the heap. As a result, some blocks may stay in memory longer than they would with a non-conservative collector.

    Here's a description from Boehm's page:

    The garbage collector uses a modified mark-sweep algorithm. Conceptually it operates roughly in four phases, which are performed occasionally as part of a memory allocation:

    1. Preparation Each object has an associated mark bit. Clear all mark bits, indicating that all objects are potentially unreachable.
    2. Mark phase Marks all objects that can be reachable via chains of pointers from variables. Often the collector has no real information about the location of pointer variables in the heap, so it views all static data areas, stacks and registers as potentially containing pointers. Any bit patterns that represent addresses inside heap objects managed by the collector are viewed as pointers. Unless the client program has made heap object layout information available to the collector, any heap objects found to be reachable from variables are again scanned similarly.
    3. Sweep phase Scans the heap for inaccessible, and hence unmarked, objects, and returns them to an appropriate free list for reuse. This is not really a separate phase; even in non incremental mode this is operation is usually performed on demand during an allocation that discovers an empty free list. Thus the sweep phase is very unlikely to touch a page that would not have been touched shortly thereafter anyway.
    4. Finalization phase Unreachable objects which had been registered for finalization are enqueued for finalization outside the collector.

    You should also know that the Boehm GC needs to be given a set of "roots", which are starting points for the mark-and-sweep algorithm. The stack and registers are automatically roots. You need to explicitly add global pointers as roots.


    EDIT: In comments, some concerns were pointed out about conservative collectors in general. It is true that integers that look like heap pointers to the collector can cause memory not to be released. This is not as big of a problem as you might think. Most scalar integers in a program are used for counts and sizes and are fairly small (so they would not look like heap pointers). You would mostly run into problems with arrays containing bitmaps, strings, floating point data, or anything of that sort. Boehm GC lets you allocate a block with GC_MALLOC_ATOMIC which indicates to the collector that the block will not contain any pointers. If you look in gc_typed.h, you will also find ways to specify what parts of a block may contain pointers.

    That said, a fundamental limitation of a conservative collector is that it cannot safely move memory around during collection since pointer rewriting is not safe. This means you won't get any of the benefits of compaction like lowered fragmentation and improved cache performance.