Search code examples
javalistsetbitset

Use cases of the BitSet over List, Set


What are the use cases when the BitSet should be preferable over the Set and List? Besides, does it have any kind of performance implication?

At least, I would like to know the best use cases for the BitSet and when I need to use them in the production code irrespective of the other data-structure mentioned in the question.


Solution

  • General use case: do a search if value V is present in dataset DS.

    Special use case: there is an organization with dataset DS of employees' emails stored in the database. You want to implement a solution to verify when a new employee account is creating there is no such email V already present in the database DS. Ideally, you want to avoid additional calls to DB. The total number of employees in the organization is 2 200 000 (Walmart had it in 2019 based on statistical data available).

    Theory: to avoid additional calls to DB you can put all existing emails from DB into some collection and search if email already present in the DB against this collection. If we speak about List and Set, they both store at least data elements by themselves with possible additional overhead based on backed data structure, for example, LinkedList will need memory overhead for pointers. There is a real chance we don't have enough memory to put all these data in our collection.

    Now, time to talk about the Bloom filter. The Bloom Filter is a fast and memory-efficient probabilistic data structure. Probabilistic data structures provide memory-efficient solutions with a tradeoff of providing a ‘probable’ result instead of a ‘guaranteed’ one (discussed above). A Bloom filter with 1% error and an optimal number of hash functions (extremely fast hash functions) requires only about 9.6 bits per element, regardless of the size of the elements. A Bloom filter doesn’t store the data elements by itself.

    There are 2 main components in the Bloom filter:

    • bite array (in Java can be represented using java.util.BitSet class)
    • hash function (one or more)

    There are 2 operations in the Bloom filter:

    • insert
      1. Value V is hashed by K hash functions generating K different indices.
      2. The values against these indices are marked as 1 (true).
        //see more details about actual implementation by the link below
        BloomFilter bf = new BloomFilter(0.1, 10); 
        //BitSet {0,0,0,0,0,0,0,0,0,0}
        bf.add("tim.brand@organization.com");
        //BitSet {0,1,0,1,0,0,0,1,0,0}
        bf.add("angel.born1@organization.com");
        //BitSet {0,1,0,1,1,0,0,1,0,1}
        bf.add("casey.clous@organization.com");
        //BitSet {1,1,0,1,0,0,1,1,0,0}
        
    • search
      1. Value V is hashed by K hash functions generating K different indices.
      2. The values against these indices are checked if all of them are 1 then there is a strong possibility of having this data in the dataset DS (needed call to DB). If any of them is zero then we definitely don’t have this value present in the dataset DS (no call to DB is needed).

    Findings: So, in this particular case BloomFilter based on BitSet seems more efficient than Set or List.

    Links:

    Note: there are requirements for has functions used and Bloom filter parameters tuning but they are out of the scope of this question.