Search code examples
pandast-test

T-test on the means pandas


I'm woking with the Movielens dataset and I would like to do the t-test on the mean ratings value of the male and female users.

import pandas as pd
from scipy.stats import ttest_ind

users_table_names= ['user_id','age','gender','occupation','zip_code']
users= pd.read_csv('ml-100k/u.user', sep='|', names= users_table_names)
ratings_table_names= ['user_id', 'item_id','rating','timestamp']
ratings= pd.read_csv('ml-100k/u.data', sep='\t', names=ratings_table_names)
rating_df= pd.merge(users, ratings)

males = rating_df[rating_df['gender']=='M']
females = rating_df[rating_df['gender']=='F']

ttest_ind(males.rating, females.rating)

And I get the following result:

Ttest_indResult(statistic=-0.27246234775012407, pvalue=0.7852671011802962)

Is this the correct way to do this operation? The results seem a bit odd.

Thank you in advance!


Solution

  • With your code you are considering a two-sided ttest with the assumption that the populations have identical variances, once you haven't specified the parameter equal_var and by default it is True on the scypi ttest_ind().

    So you can represent your statitical test as:

    • Null hypothesis (H0): there is no difference between the values recorded for male and females, or in other words, means are similar. (µMale == µFemale).
    • Alternative hypothesis (H1): there is a difference between the values recorded for male and females, or in other words, means are not similar (both the situations where µMale > µFemale and µMale < µFemale, or simply µMale != µFemale)

    The significance level is an arbitrary definition on your test, such as 0.05. If you had obtained a small p-value, smaller than your significance level, you could disprove the null hypothesis (H0) and consequently prove the alternative hypothesis (H1).

    In your results, the p-value is ~0.78, or you can't disprove the H0. So, you can assume that the means are equal.

    Considering the standard deviations of sampes as below, you could eventually define your test as equal_var = False:

    >> males.rating.std()
    1.1095557786889139
    >> females.rating.std()
    1.1709514829100405
    
    >> ttest_ind(males.rating, females.rating, equal_var = False)
    Ttest_indResult(statistic=-0.2654398046364026, pvalue=0.7906719538136853)
    

    Which also confirms that the null hypothesis (H0).

    If you use the stats model ttest_ind(), you also get the degrees of freedon used in the t-test:

    >> import statsmodels.api as sm
    >> sm.stats.ttest_ind(males.rating, females.rating, alternative='two-sided', usevar='unequal')
    (-0.2654398046364028, 0.790671953813685, 42815.86745494558)
    

    What exactly you've found odd on your results?