I have two integers, lets call them haystack
and needle
. I need to check that, if the binary representation of needle
occurred in haystack
[and OPTIONALLY find the position of the first occurrence]
haystack = 0b10101111010110010101010101110
needle = 0b1011001 # occurred in position 13
needle = 0b111011 # not occurred
I am looking for the least possible time complexity, I cannot write a code with time complexity better than O(h)
, where h
is the number of bits in the haystack. You can see my code below.
I need to check occurrence of a predefined needle
(which never changes and it's an odd number), in billions of random haystack
integers (So we cannot preprocess haystack
to optimize speed)
As finding position is optional, if you can write a code with better time complexity to just return a Boolean that indicate occurrence, it's perfect too. Because in billions of check I know that it's not occurred and when it occurred I can use the following code to find the position.
A good probabilistic algorithm with False Positive results is fine too.
def find_needle_in_haystack(haystack, needle):
n = needle.bit_length() # calculate the number of bits in needle
mask = (1 << n) - 1 # create a mask with n bits
i = 0
while haystack != 0:
x = haystack & mask # bitwise AND with mask to get first n bits
if x == needle:
return i
i += 1
haystack >>= 1 # shift haystack to the right by 1 bit
return -1 # needle not found in haystack
First, Python is a poor choice of language for this. One way or another you will do a lot of bit twiddling. Python is a high level language with slow abstraction layers for programmer convenience. That adds a lot of overhead onto any simple looking bit twiddle.
That said, I would recommend using a pre-computed lookup table for this. The idea is as follows. We need an array of, by number of matched bits, by next byte, the number of currently matched bits. This can be stored in an array of length bits * 256
. With the value at position 256*a + b
being a number from telling you that if you'd previously matched a
bits and b
is the next byte, how many bits you've NOW matched.
And now your logic looks something like this (ignoring casting):
matched = 0
for b in bytes:
matched = lookup[256*matched + int(b)]
if matched == length_of_needle:
return True
return False
Here is some sample code demonstrating the idea. Note that I 0-padded the end of the bits to wind up with an even number of bytes.
# Assuming that needle is an array of bits.
def needle_to_lookup (needle):
ord2bits = []
for j in range(256):
bits = []
k = j
for _ in range(8):
bits.append(k % 2)
k = k // 2
ord2bits.append(tuple(reversed(bits)))
lookup = []
for i in range(len(needle) + 1):
for bits in ord2bits:
# Do we successfully keep matching?
matched = i
for j in range(8):
if i + j == len(needle):
matched = i+j
break
elif needle[i+j] == bits[j]:
matched = i+j
else:
matched = 0
break
if 0 == matched: # Failed to extend for a byte.
for j in range(8):
for k in range(8 - j):
if k == len(needle):
matched = k
break
elif needle[k] == bits[j+k]:
matched = k+1
else:
matched = 0
break
if 0 < matched:
break
lookup.append(matched)
return lookup
def find_needle(needle, byte_list):
lookup = needle_to_lookup(needle)
matched = 0
for byte in byte_list:
matched = lookup[256*matched + byte]
if matched == len(needle):
return True
return False
print(find_needle([1, 0, 1, 1, 0, 0, 1], bytes([175, 89, 85, 184])))
print(find_needle([1, 1, 1, 0, 1, 1], bytes([175, 89, 85, 184])))