Search code examples
pythonmemoryscikit-learncluster-analysishierarchical-clustering

How to specify the memory directory for Agglomerative clustering using sklearn


I am trying to optimise the computational time used for computing multiple results with different amounts of clusters on the same data set using sklearn's AgglomerativeClustering.

As indicated in sklearn agglomerative clustering: dynamically updating the number of clusters, it is possible to store the entire tree computed by AgglomerativeClustering. Then, you can respecify the n_clusters-parameter of the clustering object and simply extract the new clustering result of the same data set clustered into the new amount.

I am sorry if this is a trivial question, but I have very little experience with dealing with memory using Python. My question is how to specify the cache directory used by AgglomerativeClustering. In the example in the link above, it is written as:

AgglomerativeClustering(n_clusters=10, memory='mycachedir', compute_full_tree=True)

What is 'mycachedir' exactly? Do I need to replace it by my own cache directory, or does python create a new directory somewhere called 'mycachedir'? If so, is this removed when my program ends? I would like the cache be removed once my program stops or ends. Again, I am sorry if this obvious.

I tried to run it with the string "mycachedir" and Python does not raise an error. So where is this directory located? And how does it behave? E.g., is it removed once the program ends?


Solution

  • According to scikit-learn documentation, "if a string is given, it is the path to the caching directory."

    As a matter of fact, caching is performed with the joblib.Memory class of the joblib package. The directory is created by os.makedirs(os.path.expanduser(memory)) where memory is an AgglomerativeClustering input argument. Though, it can be deleted with joblib.Memory.clear, to the best of my knowledge, this is not the case when calling AgglomerativeClustering.fit.

    Using sklearn.AgglomerativeClustering example,

    import os
    
    # EXTERNALS
    from sklearn.cluster import AgglomerativeClustering
    import numpy as np
    
    X = np.array([[1, 2], [1, 4], [1, 0],
                  [4, 2], [4, 4], [4, 0]])
    
    memory_dir = "~/tmp/my_cached_memory_folder" 
    # relative path depending on your working directory
    # (cf. `os.getcwd()`)
    
    clustering = AgglomerativeClustering(memory=memory_dir).fit(X)
    
    full_path = os.path.abspath(os.path.expanduser(memory_dir))
    
    print("Cached memory directory: " f"{full_path}")
    print(os.path.isdir(full_path))
    
    # Cached memory directory: /home/remi_cuingnet/tmp/my_cached_memory_folder
    # True
    

    Note that you have to manually clear it.