Search code examples
pythonstatisticsrandom-forestp-value

Methods to test independence in a non parametric setting


I am using random forest as my regression model and have the following data where X is a high dimensional data set in a simplex form. I have tried using permutation and the following code:

 import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

X = pd.read_csv("X_p.csv", delimiter=",", engine="c", index_col=0, low_memory=False)
Y = pd.read_csv("Y_p.csv", delimiter=",", engine="c", low_memory=False)

X = X.iloc[:, 1:]
Y = Y.iloc[:, 1:]

Y = Y['0'].values.ravel()

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=42)

rf_model = RandomForestRegressor(n_estimators=100,min_samples_split=5,max_depth=50, random_state=42)


original_mse = mean_squared_error(Y_train, rf_model.fit(X_train, Y_train).predict(X_train))

permuted_mse = []
for _ in range(10):  # Number of permutations
    permuted_Y = np.random.permutation(Y_train)
    permuted_mse.append(mean_squared_error(permuted_Y, rf_model.fit(X_train, permuted_Y).predict(X_train)))

# Step 6: Significance Testing based on MSE
p_value_mse = np.sum(permuted_mse > original_mse) / (len(permuted_mse)+1)

# Print results
print("Original MSE:", original_mse)
print("Permutation test p-value based on MSE:", p_value_mse)

output:

Original MSE: 2.2922379491349436
Permutation test p-value based on MSE: 0.9090909090909091

Process finished with exit code 0

Just running 10 permutations the code takes quite a long time and it always returns a very high p value (bassicly all permuted mse's are larger than the initial one) what is my mistake? I have tried to modify my random forest parameters but nothings seems to help.


Solution

  • I guess the idea behind using mse is to evaluate that our initial mse should be lower than a randomly permuted vision of the data

    That highlights the problem with the implementation of the permutation test: your alternative hypothesis is that the observed MSE is lower than if Y were independent of X. In other words, a lower MSE is "more extreme" than one would expect to get by chance under the null hypothesis. So the last step should be:

    p_value_mse = (np.sum(permuted_mse <= original_mse) + 1) / (len(permuted_mse)+1)
    

    This makes three changes to the above code:

    • Use < instead of > to correspond with the alternative hypothesis being tested.
    • Use <= instead of < because elements of the null distribution equal to your observed statistic would also be considered "extreme".
    • Add +1 in the numerator as well as the denominator (reference). These +1s are commonly interpreted as including the original observation in the permutation distribution, which gives an intuitive reason for including them in both the numerator and denominator. See 1 for more information, though: there is a case to be made for this being an approximation to a (long-run) "exact" p-value (despite the use of randomization).

    You might also read the SciPy tutorial about Resampling and Monte Carlo Methods if you're unfamiliar with the topic of permutation tests.