Search code examples
pythondictionarynlpcluster-analysis

Clustering sentence vectors in a dictionary


I'm working with a kind of unique situation. I have words in Language1 that I've defined in English. I then took each English word, took its word vector from a pretrained GoogleNews w2v model, and average the vectors for every definition. The result, an example with a 3 dimension vector:

L1_words={
'word1': array([ 5.12695312e-02, -2.23388672e-02, -1.72851562e-01], dtype=float32),
'word2': array([ 5.09211312e-02, -2.67828571e-01, -1.49875201e-03], dtype=float32)
}

What I want to do is cluster (using K-means probably, but I'm open to other ideas) the keys of the dict by their numpy-array values. I've done this before with standard w2v models, but the issue I'm having is that this is a dictionary. Is there another data set I can convert this to? I'm inclined to write it to a csv/make it into a pandas datafram and use Pandas or R to work on it like that, but I'm told that floats are problem when it comes to things requiring binary (as in: they lose information in unpredictable ways). I tried saving my dictionary to hdf5, but dictionaries are not supported.

Thanks in advance!


Solution

  • If I understand your question correctly, you want to cluster words according to their W2V representation, but you are saving it as dictionary representation. If that's the case, I don't think it is a unique situation at all. All you got to do is to convert the dictionary into a matrix and then perform clustering in the matrix. If you represent each line in the matrix as one word in your dictionary you should be able to reference the words back after clustering.

    I couldn't test the code below, so it may not be completely functional, but the idea is the following:

    from nltk.cluster import KMeansClusterer
    import nltk
    
    # make the matrix with the words
    words = L1_words.keys()
    X = []
    for w in words:
        X.append(L1_words[w])
    
    # perform the clustering on the matrix
    NUM_CLUSTERS=3
    kclusterer = KMeansClusterer(NUM_CLUSTERS,distance=nltk.cluster.util.cosine_distance)
    assigned_clusters = kclusterer.cluster(X, assign_clusters=True)
    
    # print the cluster each word belongs
    for i in range(len(X)):
        print(words[i], assigned_clusters[i])
    

    You can read more in details in this link.