Search code examples
c#coreclr

Why most of the data structures in generic collections use array despite of Large Object Heap fragmentation?


I could see that CoreCLR and CoreFx implicitly use array for most of the generic collections. what is the main driving factor to go with arrays and how it handles any side effects of LOH fragmentation.


Solution

  • What other then arrays should collections be?

    More importnatly, what other then arrays could collections be?

    In use collection boils down to "arrays - and stuff we wrap around arrays, for ease of use.":

    • The pure thing (arrays), wich do offer some conveniences like bounds checks in C#/.NET
    • Self growing arrays (Lists)
    • Two synchronized arrays that allow the mapping of any any input to any element (Dictionaries key/value pair)
    • Three synchornized array: Key, Value and a Hashvalue to quickly identify not-matching keys (HastTable).

    Below the hood - regardless of how hard .NET makes it to use pointers - it all boils down to some code doing C/C++ style pointer arythmethic to get the next element.

    Edit 1: As I learned in another place, .NET Dictionaries are actually implemented as HashLists. The HashList class is just the pre-generics version. Object has a GetHashCode function with sensible default behavior wich can be used, but also fully overwritten.

    Fragmentation wise the "best" would be a array of references. It can be as small as the reference width (a Pointer or slightly bigger) and the GC can move around the instances to defragment memory. Of course then you get the slight overhead of accessing references rather the just counting/mathing up a pointer, so as usualy it is a memory vs speed tradeoff. However this might go into Speed Rant Territory of detail.

    Edit 2: As Markus Appel pointed out in the comments, there is something even better for fragmentation avoidance: Linked lists. Even that single array of references - if you just make it big enough - will take quite some memory in one indivisible chunk. So it might run into object size limits or array indexer limits. A linked list will do neither. But as a result the performance is around a disk that was never defragmented.

    Generics is just a convience to have typesafety in collections/other places. It avoids you having to use the dreaded Object as type, wich ruins all compile-time typesafety. Afaik they add nothing else to this situation. List<string> works the same as a StringList would.