Search code examples
pythoncluster-analysist-test

t-test for two clusters Python


I am doing kmeans clustering, and want to test the resultant clusters are statistically different. In 3 level clustering, I test cluster 0 with cluster 1 and then with cluster 2. Then I test cluster 2 with cluster 3. I tried to apply t-test clustering as shown in the following code. The clusters have different lengths as you know. I am confused about the logic? Should I use p>0.05 or p<0.05. Then where to put True and False?

  def compare_2_groups(ar1,ar2):
    s,p=ttest_ind(ar1,ar2)
    #if p>0.05:
    if p<0.05:
        return False
    else:
        return True

Solution

  • This procedure should work, even if ar1 and ar2 have different lengths. The p value result indicates the strength of evidence AGAINST the null hypothesis that the two clusters have the same center, where smaller p indicates stronger evidence. Two suggestions:

    • rename the function to reflect the nature of the test, like "are_group_centers_equal"
    • if using this name return False if p < (your threshold), True otherwise

    If you choose a name with the opposite meaning "are_group_centers_different", reverse the logic of the threshold test, returning True if p < (threshold).