I have an input file of about 10^5 rows.
Each row is a sequence of 24 bits, i.e.:
1 0 1 1 1 0 1 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 1 0
I need to compute the Hamming Distance for every pair of rows. Here's my first implementation using SciPy hamming function:
from scipy.spatial.distance import hamming
with open('input.txt', 'r') as file:
reader = csv.reader(file, delimiter=' ')
nodes = {}
b = 24 # Number of bits
for nodeNum, node in enumerate(reader):
node[nodeNum] = [int(i) for i in node]
for u, uBits in nodes.items():
for v, vBits in nodes.items():
distance = hamming(uBits, vBits) * b
# Do stuff
Second implementation I came up with:
node[nodeNum] = sum([int(bit)*2**power for power, bit in enumerate(node)])
Here I only store the decimal value but I then have to manually count set bits resulting from each XOR operation:
def hamming(a, b):
N = a ^ b
distance = 0
ptr = 1
while N:
distance += ((N + 1) //
2 * ptr)
N -= (N + 1) // 2
ptr += 1
return distance
How can I improve my code (both in terms of memory usage and runtime, ideally)?
This may be the fastest you can do (watchout, for your data size it'd require to allocate 74.5 GiB of memory):
import numpy as np
nodes = []
with open('input.txt', 'r') as file:
reader = csv.reader(file, delimiter=' ')
for node in reader:
nodes.append([int(i) for i in node])
dists = 2 * np.inner(nodes-0.5, 0.5-nodes) + nodes.shape[1] / 2
Just for fun, here is a 40X faster version in Julia:
using LoopVectorization, Tullio
function hamming!(nodes,dists)
@tullio dists[i,j] = sum(nodes[i,k] ⊻ nodes[j,k])
end
n = 10^5
nodes = rand(Int8[0,1],n,24)
dists = Matrix{Int8}(undef,n,n)
@time hamming!(nodes,dists) # Run twice
# 1.886367 seconds (114 allocations: 6.594 KiB)
While we're at it, I invite you to enter the world of Julia. It offers similar speeds to C++ and a pleasant syntax similar to Python.