Search code examples
pythonaerospikehyperloglog

Get Aerospike hyperLogLog(HLL) intersection count of multiple HLL unions


I have 2 or more HLLs that are unioned, I want to get the intersection count of that unions. I have used the example from here hll-python example Following is my code

ops = [hll_ops.hll_get_union(HLL_BIN, records)]
_, _, result1 = client.operate(getKey(value), ops)

ops = [hll_ops.hll_get_union(HLL_BIN, records2)]
_, _, result2 = client.operate(getKey(value2), ops)

ops = [hll_ops.hll_get_intersect_count(HLL_BIN, [result1[HLL_BIN]] + [result2[HLL_BIN]])]
_, _, resultVal = client.operate(getKey(value), ops)
print(f'intersectAll={resultVal}')
_, _, resultVal2 = client.operate(getKey(value2), ops)
print(f'intersectAll={resultVal2}')

I get 2 different results when I use different keys for the intersection using hll_get_intersect_count, i.e resultVal and resultVal2 are not same. This does not happen in the case of union count using function hll_get_union_count. Ideally the value of intersection should be the same.
Can any one tell me why is this happening and what is the right way to do it?


Solution

  • Was able to figure out the solutions (with the help of Aerospike support, the same question was posted here and discussed more elaboratively aerospike forum).
    Posting my code for others having the same issue.

    Intersection of HLLs is not supported in Aerospike. However, If I am to get intersection of multiple HLLs I will have to save one union into aerospike and then get intersection count of one vs the rest of the union. The key we provide in client.operate function for hll_get_intersect_count is used to get the intersection with the union.
    Following is the code I came up with

    ops = [hll_ops.hll_get_union(HLL_BIN, records)]
    _, _, result1 = client.operate(getKey(value), ops)
    
    # init HLL bucket
    ops = [hll_ops.hll_init(HLL_BIN, NUM_INDEX_BITS, NUM_MH_BITS)]
    _, _, _ = client.operate(getKey('dummy'), ops)
    # get set union and insert to inited hll bucket and save it for 5 mins(300 sec)
    # use hll_set_union to save the HLL into aeropike temporarily
    ops = [hll_ops.hll_set_union(HLL_BIN, [records2])]
    _, _, _ = client.operate(getKey('dummy'), ops, meta={"ttl": 300})
    
    ops = [hll_ops.hll_get_intersect_count(HLL_BIN, [result1[HLL_BIN]])]
    _, _, resultVal = client.operate(getKey('dummy'), ops)
    print(f'intersectAll={resultVal}')
    

    For more reference, you can look here for hll_set_union reference.
    More elaborate discussion can be found here