Search code examples

Clustering ip-addresses on domain names

I have an ip-network which is basically a list of sequential ip-addresses. From this list I want to cluster ranges of ip-addresses into independent entities. I want to give each IP in the range a set of properties like time to live, nameservers and domain names associated with it.

I then want to determine the distance between each IP-address and its neighbors and start clustering based on shortest distance.

My question lies in the distance function. TTL is a number so that should not be a problem. Nameservers and domain names are strings however, how would you represent those as numbers in a vector?

Basically if 2 IP-addresses have the same nameserver or very similar domain names (equal 2LD) you want them to have a smaller distance. I've looked into something like word2vec but can't really find a useful implementation.


  • I would try using difflib like this.

    from difflib import SequenceMatcher
    def similarity(a, b):
        return SequenceMatcher(None, a, b).ratio()

    Then you can call the function against each set of names to get a similarity score and group them based on that.
