python algorithm binding substring binary-data

Searching for sequence of bits in an integer in Python

I have two integers, lets call them haystack and needle. I need to check that, if the binary representation of needle occurred in haystack [and OPTIONALLY find the position of the first occurrence]

Example

haystack = 0b10101111010110010101010101110
needle = 0b1011001 # occurred in position 13
needle = 0b111011 # not occurred

I am looking for the least possible time complexity, I cannot write a code with time complexity better than O(h), where h is the number of bits in the haystack. You can see my code below.

I need to check occurrence of a predefined needle (which never changes and it's an odd number), in billions of random haystack integers (So we cannot preprocess haystack to optimize speed)

As finding position is optional, if you can write a code with better time complexity to just return a Boolean that indicate occurrence, it's perfect too. Because in billions of check I know that it's not occurred and when it occurred I can use the following code to find the position.

A good probabilistic algorithm with False Positive results is fine too.

def find_needle_in_haystack(haystack, needle):
    n = needle.bit_length()  # calculate the number of bits in needle
    mask = (1 << n) - 1  # create a mask with n bits
    i = 0
    while haystack != 0:
        x = haystack & mask  # bitwise AND with mask to get first n bits
        if x == needle:
            return i
        i += 1
        haystack >>= 1  # shift haystack to the right by 1 bit
    return -1  # needle not found in haystack

Solution

First, Python is a poor choice of language for this. One way or another you will do a lot of bit twiddling. Python is a high level language with slow abstraction layers for programmer convenience. That adds a lot of overhead onto any simple looking bit twiddle.

That said, I would recommend using a pre-computed lookup table for this. The idea is as follows. We need an array of, by number of matched bits, by next byte, the number of currently matched bits. This can be stored in an array of length bits * 256. With the value at position 256*a + b being a number from telling you that if you'd previously matched a bits and b is the next byte, how many bits you've NOW matched.

And now your logic looks something like this (ignoring casting):

matched = 0
for b in bytes:
    matched = lookup[256*matched + int(b)]
    if matched == length_of_needle:
        return True
return False

Here is some sample code demonstrating the idea. Note that I 0-padded the end of the bits to wind up with an even number of bytes.

# Assuming that needle is an array of bits.
def needle_to_lookup (needle):
    ord2bits = []
    for j in range(256):
        bits = []
        k = j
        for _ in range(8):
            bits.append(k % 2)
            k = k // 2
        ord2bits.append(tuple(reversed(bits)))

    lookup = []
    for i in range(len(needle) + 1):
        for bits in ord2bits:
            # Do we successfully keep matching?
            matched = i
            for j in range(8):
                if i + j == len(needle):
                    matched = i+j
                    break
                elif needle[i+j] == bits[j]:
                    matched = i+j
                else:
                    matched = 0
                    break
            if 0 == matched: # Failed to extend for a byte.
                for j in range(8):
                    for k in range(8 - j):
                        if k == len(needle):
                            matched = k
                            break
                        elif needle[k] == bits[j+k]:
                            matched = k+1
                        else:
                            matched = 0
                            break
                    if 0 < matched:
                        break
            lookup.append(matched)
    return lookup

def find_needle(needle, byte_list):
    lookup = needle_to_lookup(needle)
    matched = 0
    for byte in byte_list:
        matched = lookup[256*matched + byte]
        if matched == len(needle):
            return True
    return False
            

print(find_needle([1, 0, 1, 1, 0, 0, 1], bytes([175, 89, 85, 184])))
print(find_needle([1, 1, 1, 0, 1, 1], bytes([175, 89, 85, 184])))