Search code examples
pythonpython-3.xrandom-forest

Loop to find a maximum R2 in python


I am trying to make a decision tree but optimizing the sampling values ​​to use.

I am using a group of values ​​like:

DATA1 DATA2 DATA3 VALUE 100 300 400 1.6 102 298 405 1.5 88 275 369 1.9 120 324 417 0.9 103 297 404 1.7 110 310 423 1.1 105 297 401 0.7 099 309 397 1.6 . . .

My mission is to make a decision tree so that from Data1, Data2 and Data3 to be able to predict a value of Data to be predicted.

I have started to carry out a classification forest that gives me a coefficient of determination as a result. I attach it below:

#Datos
X = dfs.drop(columns='Dato a predecir')
y = dfs.Datos a predecir

# 70 % del conjunto de datos para entrenamiento y 30 % para validación
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    train_size = 0.7,
                                                    random_state = 0,
                                                    )

 # Crear el modelo para ajustar
bosque = RandomForestClassifier(n_estimators=71,
                             criterion="gini",
                             max_features="sqrt",
                             bootstrap=True,
                             max_samples = 2/3,
                             oob_score=True
                             )
     
bosque.fit(X_train, y_train)
y_pred = bosque.predict(X_test)
   

r, p = stats.pearsonr(y_pred,y_test)
print(f"Correlación Pearson: r={r}, p-value={p}")    

Well, starting from this code, and thanks to "bootstrap=True" I manage to have a new set of training data and a new coefficient of determination every time I run the code.

Can anyone help me loop this code to get the maximum value of the coefficient of determination and save the training data used so that I can make the optimal decision tree?

I have tried to perform a for loop but it doesn't really work. It is the following:

for i in range (10000):
while r <1:
    Arbol_decisión(X,y)
    r=r
i=i+1

The range used is that it does not represent all the data I have and I would need to find the maximum possible combinations of my data, and the letter "r" represents the value of the coefficient of determination. I am aware that the loop I have made is stupid, but the truth is that I can't think of how to achieve it. Could you help me?

Many thanks for everything.

I try to be able to perform loops to obtain the largest number of matrices possible and optimize my decision tree


Solution

  • Firstly, you need to use a validation set AND a test set if you're going to approach it like this. Otherwise you will just have biased results and likely a model which is essentially overfit to the testing data.

    Secondly, if you are only randomly sampling your data (that's what bootstrap does), then all these results are telling you is that your dataset isn't great. Ideally a dataset should represent samples from the underlying distribution. Therefore, using more data is better as your model can more effectively learn the underlying distribution. In your case you are approaching the problem from the perspective that some of your data does NOT represent the underlying distribution (that's why you want to ignore it). If this is the case, then you should just clean your data properly in advance. If you can't figure out a way to identify these 'bad' data points, then I would not suggest messing around with this - since you would just be cherry-picking data and producing a bad model.

    I would generally suggest you take a pause on writing code and read up more on the theory behind decision trees, random forests and bootstrapping. Since otherwise you're likely to only be designing poor ML experiments.

    If for some reason you think this is still a good approach (it's almost certainly not), then just do the bootstrapping yourself... Something like the code below (there is probably a more optimised solution).

    
    X = np.arange(1000)
    y = np.arange(1000)/100
    
    # Selecting random train/val/test dataset
    # Define slices for 60% train, 20% val, 20% test
    train_size = slice(0, int(len(X) * 0.6))
    val_size = slice(int(len(X) * 0.6), int(len(X) * 0.8))
    test_size = slice(int(len(X) * 0.8), int(len(X) * 1))
    
    # Randomise the indices corresponding to X and y
    # (same size so only do once)
    rnd_idx = np.random.choice(np.arange(len(X)),
                               len(X),
                               replace=False)
    
    # Loop through the three dataset sizes and select randomised,
    # non-overlapping data for them.
    X_tr, X_va, X_te = [X[rnd_idx[sliced]] for sliced in [train_size, val_size, test_size]]
    y_tr, y_va, y_te = [X[rnd_idx[sliced]] for sliced in [train_size, val_size, test_size]]
    
    ###
    ### Define random forest here
    ###
    
    # Define the bootstrap size and method
    # Here we are sub-selecting 90% of the training data
    bootstrap_size = slice(0, int(len(X_tr) * 0.9))
    # And using replacement, so can expect ~30% duplicates of data.
    replace = True
    
    # Define an acceptable threshold for performance
    acceptable_r = 0.9
    
    # Set initial value (non-physically low)
    r = -10
    # Do a while loop that repeats until the performance is appropriate
    while r < acceptable_r:
        # Create randomised indices corresponding to the training set
        rnd_idx2 = np.random.choice(np.arange(len(X_tr)),
                                    len(X_tr),
                                    replace=replace)
        # Subselect the bootstrapped training data
        X_tr_s = X_tr[rnd_idx2[bootstrap_size]]
        y_tr_s = y_tr[rnd_idx2[bootstrap_size]]
    
        ###
        ### Fit model here
        ###
    
        ###
        ### Apply to validation data here
        ###
    
        ###
        ### Calculate metric here
        ###
        
        r = r
    
    ###
    ### Apply to testing data here
    ###
        
    

    Once the while loop exits, you can retrieve the corresponding training data and the indices and the model etc.