Search code examples
pythonpandasanova

The one-way ANOVA function I'm using keeps spitting out F values that don't make sense


I'm working on a project for college and it's kicking my ass.

I downloaded a data file from https://www.kaggle.com/datasets/majunbajun/himalayan-climbing-expeditions

I'm trying to use an ANOVA to see if there's a statistically significant difference in time taken to summit between the seasons.

The F value I'm getting back doesn't seem to make any sense. Any suggestions?

#import pandas
import pandas as pd

#import expeditions as csv file
exp = pd.read_csv('C:\\filepath\\expeditions.csv')

#extract only the data relating to everest
exp= exp[exp['peak_name'] == 'Everest']

#create a subset of the data only containing 
exp_peaks = exp[['peak_name', 'member_deaths', 'termination_reason', 'hired_staff_deaths', 'year', 'season', 'basecamp_date', 'highpoint_date']]

#extract successful attempts
exp_peaks = exp_peaks[(exp_peaks['termination_reason'] == 'Success (main peak)')]

#drop missing values from basecamp_date & highpoint_date
exp_peaks = exp_peaks.dropna(subset=['basecamp_date', 'highpoint_date'])

#convert basecamp date to datetime
exp_peaks['basecamp_date'] = pd.to_datetime(exp_peaks['basecamp_date'])
#convert basecamp date to datetime
exp_peaks['highpoint_date'] = pd.to_datetime(exp_peaks['highpoint_date'])

from datetime import datetime

exp_peaks['time_taken'] = exp_peaks['highpoint_date'] - exp_peaks['basecamp_date']

#convert seasons from strings to ints
exp_peaks['season'] = exp_peaks['season'].replace('Spring', 1)
exp_peaks['season'] = exp_peaks['season'].replace('Autumn', 3)
exp_peaks['season'] = exp_peaks['season'].replace('Winter', 4)
#remove summer and unknown
exp_peaks = exp_peaks[(exp_peaks['season'] != 'Summer')]
exp_peaks = exp_peaks[(exp_peaks['season'] != 'Unknown')]

#subset the data according to the season
exp_peaks_spring = exp_peaks[exp_peaks['season'] == 1]
exp_peaks_autumn = exp_peaks[exp_peaks['season'] == 3]
exp_peaks_winter = exp_peaks[exp_peaks['season'] == 4]

#calculate the average time taken in spring
exp_peaks_spring_duration = exp_peaks_spring['time_taken']
mean_exp_peaks_spring_duration = exp_peaks_spring_duration.mean()

#calculate the average time taken in autumn
exp_peaks_autumn_duration = exp_peaks_autumn['time_taken']
mean_exp_peaks_autumn_duration = exp_peaks_autumn_duration.mean()

#calculate the average time taken in winter
exp_peaks_winter_duration = exp_peaks_winter['time_taken']
mean_exp_peaks_winter_duration = exp_peaks_winter_duration.mean()

# Turn the season column into a categorical
exp_peaks['season'] = exp_peaks['season'].astype('category')
exp_peaks['season'].dtypes


from scipy.stats import f_oneway

# One-way ANOVA
f_value, p_value = f_oneway(exp_peaks['season'], exp_peaks['time_taken'])
print("F-score: " + str(f_value))
print("p value: " + str(p_value))

Solution

  • It seems that f_oneway requires the different samples of continuous data to be arguments, rather than taking a categorical variable argument. You can achieve this using groupby.

    f_oneway(*(group for _, group in exp_peaks.groupby("season")["time_taken"]))
    

    Or equivalently, since you have already created series for each season:

    f_oneway(exp_peaks_spring_duration, exp_peaks_autumn_duration, exp_peaks_winter_duration)
    

    I would have thought there would be an easier way to perform an ANOVA in this common case but can't find it.