I have an ip-network which is basically a list of sequential ip-addresses. From this list I want to cluster ranges of ip-addresses into independent entities. I want to give each IP in the range a set of properties like time to live, nameservers and domain names associated with it.
I then want to determine the distance between each IP-address and its neighbors and start clustering based on shortest distance.
My question lies in the distance function. TTL is a number so that should not be a problem. Nameservers and domain names are strings however, how would you represent those as numbers in a vector?
Basically if 2 IP-addresses have the same nameserver or very similar domain names (equal 2LD) you want them to have a smaller distance. I've looked into something like word2vec but can't really find a useful implementation.
I would try using difflib like this.
from difflib import SequenceMatcher
def similarity(a, b):
return SequenceMatcher(None, a, b).ratio()
Then you can call the function against each set of names to get a similarity score and group them based on that.
similarity("server1","server1")
1.0
similarity("Server1","Server2")
0.8571428571428571
similarity("foo","bar")
0.0