python ip ip-address word2vec hierarchical-clustering

Clustering ip-addresses on domain names

I have an ip-network which is basically a list of sequential ip-addresses. From this list I want to cluster ranges of ip-addresses into independent entities. I want to give each IP in the range a set of properties like time to live, nameservers and domain names associated with it.

I then want to determine the distance between each IP-address and its neighbors and start clustering based on shortest distance.

My question lies in the distance function. TTL is a number so that should not be a problem. Nameservers and domain names are strings however, how would you represent those as numbers in a vector?

Basically if 2 IP-addresses have the same nameserver or very similar domain names (equal 2LD) you want them to have a smaller distance. I've looked into something like word2vec but can't really find a useful implementation.

Solution

I would try using difflib like this.

from difflib import SequenceMatcher

def similarity(a, b):
    return SequenceMatcher(None, a, b).ratio()

Then you can call the function against each set of names to get a similarity score and group them based on that.

similarity("server1","server1")
1.0

similarity("Server1","Server2")
0.8571428571428571

similarity("foo","bar")
0.0