Search code examples
pythonrt-test

Different t-test pvalues between R and Python


I'm currently a python newb and am trying to learn more about propensity score matching. I found a great tutorial from Stanford.edu( since this is my first post stack overflow won't let me post two links but google Stanford propensity score matching) that covers this. My goal was to recreate this all in python and understand what's happening.

My issue is when I get to section 1.2 Difference-in-means: pre-treatment covariates and start running t-test. I don't understand why the p-values are so different between R and Python for the same test and same data.

R code: with(ecls, t.test(race_white ~ catholic, var.equal=FALSE))

R output:

Welch Two Sample t-test

data:  race_white by catholic
t = -13.453, df = 2143.3, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.1936817 -0.1444003
sample estimates:
mean in group 0 mean in group 1 
      0.5561246       0.7251656

When I perform the same think in python my t-stat and degrees of freedom are identical but my p-values are way off.

Python code:

cath=dat[dat['catholic']==1]['race_white']
noncath=dat[dat['catholic']==0]['race_white']
fina =sms.ttest_ind(noncath,cath,alternative='two-sided', usevar='unequal')
print(fina)
print("The t-statistic is %.3f the p-value is %.3f and the df is %.3f"%fina) 

Python output: (-13.45342570302274, 1.1413329198468439e-39, 2143.2902027156415) The t-statistic is -13.453 the p-value is 0.000 and the df is 2143.290'

I'm using the exact same dataset just can't figure out why the two are different. I saw in another SO topic that was similar but their conclusion was the sizes were different. This is using the same data set so size isn't different.

The data file can be found here for data file(ecls.csv) that is used for both python and R. Any help as to why the p-values are different for this t-test is greatly appreciated.


Solution

  • R does not print p-values below 2.2e-16, but they are calculated and stored. Try this for your R code:

    with(ecls, t.test(race_white ~ catholic, var.equal=FALSE))$p.value
    [1] 1.141333e-39
    

    The value is effectively zero, which is why when you print it to 3 decimal places using Python, you see 0.000. Try printing the unmodified p-value in Python (don't use %.3f - in fact you did already! print(fina)) and I would hope you would see about the same value as for R (in fact you do!)