Search code examples
pythonmachine-learningmeta-learning

Meta-feature analysis: split data for computation on available memory


I am working with the meta-feature extractor package: pymfe for complexity analysis. On a small dataset, this is not a problem, for example.

pip install -U pymfe

from sklearn.datasets import make_classification
from sklearn.datasets import load_iris
from pymfe.mfe import MFE

data = load_iris()
X= data.data
y = data.target

extractor = MFE(features=[ "t1"], groups=["complexity"],
                  summary=["min", "max", "mean", "sd"])
extractor.fit(X,y)
extractor.extract()
(['t1'], [0.12])

My dataset is large (32690, 80) and this computation gets killed for exessive memory usage. I work on Ubuntu 24.04 having 32GB RAM.

To reproduce scenario:

# Generate the dataset
X, y = make_classification(n_samples=20_000,n_features=80,
    n_informative=60, n_classes=5, random_state=42)

extractor = MFE(features=[ "t1"], groups=["complexity"],
                  summary=["min", "max", "mean", "sd"])
extractor.fit(X,y)
extractor.extract()
Killed

Question:

How do I split this task to compute on small partitions of the dataset, and combine final results (averaging)?


Solution

  • Managed to find a workaround.

    # helper functions
    def split_dataset(X, y, n_splits):
        # data splits
        split_X = np.array_split(X, n_splits)
        split_y = np.array_split(y, n_splits)
        return split_X, split_y
    
    def compute_meta_features(X, y):
        # meta-features for a partition
        extractor = MFE(features=["t1"], groups=["complexity"], 
            summary=["min", "max", "mean", "sd"])
        extractor.fit(X, y)
        return extractor.extract()
    
    def average_results(results):
        # summary of results
        features = results[0][0]
        summary_values = np.mean([result[1] for result in results], axis=0)
        return features, summary_values
    
    # Split dataset
    n_splits = 10  # ten splits
    split_X, split_y = split_dataset(X, y, n_splits)
    
    #  meta-features 
    results = [compute_meta_features(X_part, y_part) for X_part, y_part in zip(split_X, split_y)]
    
    # Combined results
    final_features, final_summary = average_results(results)