Search code examples
pythonmachine-learningscipyscikit-learnprediction

Is there a way to "recreate" data with scikit-learn?


My question is about scikit-learn in python. Let's say that I have 3 features A, B and C, with A and B capable of predicting C in such code:

exampleModel.fit(AandB, C) 
exampleModel.predict(C) 

Is there a way for me to input some C value, and get the A and B values necessary to achieve that C value? Almost inputing it in reverse. If there is a way, how is it called?


Solution

  • Yes, it's possible!

    But you need to be very clear in what you want: How are A, B and C related - decide what is the appropriate prediction model.

    You also need to realize that there can't be a "perfect" reconstruction. A and B are in general a richer representation than C. Unless they are very restricted (for example high correlation between A and B) some information is lost when going from AB to C. This information cannot be recovered.

    This will not work for all models, it works differently for every model, and it is not directly implemented in scikit-learn. In other words, you will have to do some manual work and you need to know what you are doing. In particular, you need to understand the model you are working with. There is no drop-it-in-and-go-for-it solution.

    Let's assume that A and B are continuous features, and C is discrete 0 or 1. In this case an appropriate model would be a classifier. Let's further assume the A and B cluster nicely in different blobs for different C. In this case a linear classifier could work.

    I'll provide an example with Linear Discriminant Analysis. It works by linearly projecting the class centers into the most discriminative direction. In general, we would need to reverse this projection, but we are lucky in that the LDA exposes the original class centers. To get a representation of the original features from a given C we only need to find the correct class center.

    import numpy as np
    from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
    from sklearn.datasets import make_blobs
    import matplotlib.pyplot as plt
    
    np.random.seed(7)
    
    
    def inverse_lda(lda, C):
        c = np.flatnonzero(model.classes_ == C)
        return model.means_[c]
    
    
    AB, C = make_blobs(n_samples=333, n_features=2, centers=2)  # toy data
    A, B = AB.T
    
    plt.scatter(A, B, c=C, alpha=0.5)
    plt.xlabel('A')
    plt.ylabel('B')
    
    model = LDA(store_covariance=True).fit(AB, C)
    
    # reconstruct A and B for C=[0, 1]
    ABout = inverse_lda(model, C=[0, 1])
    
    plt.plot(ABout[0, 0], ABout[0, 1], 'o', label='C=0')
    plt.plot(ABout[1, 0], ABout[1, 1], 'o', label='C=1')
    plt.legend()
    

    enter image description here