Search code examples
pythonipip-addressword2vechierarchical-clustering

Clustering ip-addresses on domain names


I have an ip-network which is basically a list of sequential ip-addresses. From this list I want to cluster ranges of ip-addresses into independent entities. I want to give each IP in the range a set of properties like time to live, nameservers and domain names associated with it.

I then want to determine the distance between each IP-address and its neighbors and start clustering based on shortest distance.

My question lies in the distance function. TTL is a number so that should not be a problem. Nameservers and domain names are strings however, how would you represent those as numbers in a vector?

Basically if 2 IP-addresses have the same nameserver or very similar domain names (equal 2LD) you want them to have a smaller distance. I've looked into something like word2vec but can't really find a useful implementation.


Solution

  • I would try using difflib like this.

    from difflib import SequenceMatcher
    
    def similarity(a, b):
        return SequenceMatcher(None, a, b).ratio()
    

    Then you can call the function against each set of names to get a similarity score and group them based on that.

    similarity("server1","server1")
    1.0
    
    similarity("Server1","Server2")
    0.8571428571428571
    
    similarity("foo","bar")
    0.0