I'm woking with the Movielens dataset and I would like to do the t-test on the mean ratings value of the male and female users.
import pandas as pd
from scipy.stats import ttest_ind
users_table_names= ['user_id','age','gender','occupation','zip_code']
users= pd.read_csv('ml-100k/u.user', sep='|', names= users_table_names)
ratings_table_names= ['user_id', 'item_id','rating','timestamp']
ratings= pd.read_csv('ml-100k/u.data', sep='\t', names=ratings_table_names)
rating_df= pd.merge(users, ratings)
males = rating_df[rating_df['gender']=='M']
females = rating_df[rating_df['gender']=='F']
ttest_ind(males.rating, females.rating)
And I get the following result:
Ttest_indResult(statistic=-0.27246234775012407, pvalue=0.7852671011802962)
Is this the correct way to do this operation? The results seem a bit odd.
Thank you in advance!
With your code you are considering a two-sided ttest with the assumption that the populations have identical variances, once you haven't specified the parameter equal_var and by default it is True on the scypi ttest_ind().
So you can represent your statitical test as:
The significance level is an arbitrary definition on your test, such as 0.05. If you had obtained a small p-value, smaller than your significance level, you could disprove the null hypothesis (H0) and consequently prove the alternative hypothesis (H1).
In your results, the p-value is ~0.78, or you can't disprove the H0. So, you can assume that the means are equal.
Considering the standard deviations of sampes as below, you could eventually define your test as equal_var = False:
>> males.rating.std()
1.1095557786889139
>> females.rating.std()
1.1709514829100405
>> ttest_ind(males.rating, females.rating, equal_var = False)
Ttest_indResult(statistic=-0.2654398046364026, pvalue=0.7906719538136853)
Which also confirms that the null hypothesis (H0).
If you use the stats model ttest_ind(), you also get the degrees of freedon used in the t-test:
>> import statsmodels.api as sm
>> sm.stats.ttest_ind(males.rating, females.rating, alternative='two-sided', usevar='unequal')
(-0.2654398046364028, 0.790671953813685, 42815.86745494558)
What exactly you've found odd on your results?