Search code examples
c#performancesearchcollectionsrandom-access

Fast random access to a collection


I'm consuming a stream of semi-random tokens. For each token, I'm maintaining a lot of data (including some sub-collections).

The number of unique tokens is unbounded but in practice tends to be on the order of 100,000-300,000.

I started with a list and identified the appropriate token object to update using a Linq query.

public class Model {
    public List<State> States { get; set; }
    ...
}

var match = model.States.Where(x => x.Condition == stateText).SingleOrDefault();

Over the first ~30k unique tokens, I was able to find and update ~1,100 tokens/sec.

Performance analysis shows that 85% of the total Cpu cycles are being spent on the Where(...).SingleOrDefault() (which makes sense, lists are inefficient way to search).

So, I switched the list over to a HashSet and profiled again, confident that HashSet would be able to random seek faster. This time, I was only processing ~900 tokens/sec. And a near-identical amount of time was spent on the Linq (89%).

So... First up, am I misusing the HashSet? (Is using Linq is forcing a conversion to IEnumerable and then an enumeration / something similar?)

If not, what's the best pattern to implement myself? I was under the impression that HashSet already does a Binary seek so I assume I'd need to build some sort of tree structure and have smaller sub-sets?

To answer some questions form comments... The condition is unique (if I get the same token twice, I want to update the same entry), the HashSet is the stock .Net implementation (System.Collections.Generic.HashSet<T>).

A wider view of the code is...

        var state = new RollingList(model.StateDepth); // Tracks last n items and drops older ones. (Basically an array and an index that wraps around
        var tokens = tokeniser.Tokenise(contents); // Iterator
        foreach (var token in tokens) {
            var stateText = StateToString(ref state);
            var match = model.States.Where(x => x.Condition == stateText).FirstOrDefault();
            // ... update the match as appropriate for the token
        }

Solution

  • var match = model.States.Where(x => x.Condition == stateText).SingleOrDefault();
    

    If you're doing that exact same thing with a hash set, that's no savings. Hash sets are optimized for quickly answering the question "is this member in the set?" not "is there a member that makes this predicate true in the set?" The latter is linear time whether it is a hash set or a list.

    Possible data structures that meet your needs:

    • Make a dictionary mapping from text to state, and then do a search in the dictionary on the text key to get the resulting state. That's O(1) for searching and inserting in theory; in practice it depends on the quality of the hash.

    • Make a sorted dictionary mapping from text to state. Again, search on text. Sorted dictionaries keep the keys sorted in a balanced tree, so that's O(log n) for searching and inserting.