Java G1GC - Card Table (CT) vs Remembered Set (RS)

Why g1 needs both of these data structures ?

My understanding is:

CT holds the information about references' actual location in old generation.
RS is specific to each region, each region has one RS associated with it, it stores info about outside references which points to objects in this region. So while navigating RS, actual location of reference can be found using CT.

Why can't RS hold all of the information instead of using CT ?

Is it because otherwise there would be too much of duplicate data stored in different RSs ? For example - Young generation has regions A and B, objects in both these regions have same outside reference from old generation. In this case both RSs associated with region A and B will store this information which is redundant, is this explanation correct ?

Solution

Let's first put some things in correct order. A Card Table shows were there might be incoming references into this region. It does sort of a "hash". For a single byte in the card table - there are many bytes in the old region. It's like saying that if (in theory) you have a card table that looks like this (a single card is marked as "dirty")

0 1 0 0 0 0

For that 1 (dirty card) there is a certain mapping in the old region that needs to be scanned also, when scanning the young regions.

      0          1      0     .....

    0 - 512   512-1024  ...........

So that dirty card corresponds to a certain portion (from 512 to 1024 bytes) in the old generation, that will be scanned also, as part of the young generation scanning.

G1 has regions and now you need to understand how CT and RS work in tandem. Let's say GC scans Region1 at this point in time, it takes everything that is alive from this region and copies into Region2. At the same time Region2 has references to Region3. If we use CT, Region2 will be marked as "dirty" (by placing the specific card in the card table).

Next cycle wants to scan Region3. How does Region3 knows if there are other regions that might point into it? And indeed there are such cases: Region2 has references into Region3. Well, it could look into the CT and check every single dirty card (and thus every region that these dirty cards corresponds to) and see if there are references from any of those regions in here. Think about it: in order to clear Region3, G1 has to look at the entire CT. In the worst case scenario, it should be scanning the entire heap - for a single region.

Thus: Remembered Sets. These data structures are populated based on what CT knows about, by an asynchronous thread. When Region2 is marked as dirty, an asynchronous thread will start computing its RS. When Region3 needs to be scanned, its RS will only have an entry with Region2.

Thus, in order to scan a single region, G1 only needs to look into that particular RS.