Search code examples
pythonpandascategorical-dataanova

How to convert str variables into distinct categories in a dataframe?


I'm trying to convert data in order to be able to analyse it and as I'm not very experienced I keep running into problems. I've already received some great advice from the community but once again I'm stumped.

I downloaded a data file from https://www.kaggle.com/datasets/majunbajun/himalayan-climbing-expeditions.

@LancelotduLac was kind enough to fix the first part of the problem for me by showing me how to convert the various reasons for termination into a binary variable

from pandas import read_csv

RE = '^Success.*$'
NRE = '^((?!Success).)*$'
TR = 'termination_reason'
BD = 'basecamp_date'
SE = 'season'

data = read_csv('C:\\Users\\joepf\\OneDrive\\Desktop\\Data analytics course\\Programming1\\CA2\\data\\expeditions.csv')

exp_win_v_fail = data[[TR, BD, SE]]

for v, re_ in enumerate((NRE, RE)):
    exp_win_v_fail[TR] = exp_win_v_fail[TR].replace(to_replace=re_, value=v, regex=True)

Then I was trying convert the seasons into categorical variables in order to carry out an ANOVA which has not been going so well

# Turn the season column into a categorical
exp_win_v_fail['season'] = exp_win_v_fail['season'].astype('category')
exp_win_v_fail['season'].dtypes


from scipy.stats import f_oneway

# One-way ANOVA
f_value, p_value = f_oneway(exp_win_v_fail[SE], exp_win_v_fail[TR])
print("F-score: " + str(f_value))
print("p value: " + str(p_value))

I assumed that I would not need to convert the seasons from a str if I converted them into categorical variables but then the console throws up this error message which is making me second guess that assumption:

 File "C:\Users\joepf\anaconda3\lib\site-packages\numpy\core\_asarray.py", line 102, in asarray
    return array(a, dtype, copy=False, order=order)

ValueError: could not convert string to float: 'Spring'

Any suggestions would be much appreciated


Solution

  • Figured out how to make it run by changing the seasons into ints

    #convert seasons from strings to ints
    exp_win_v_fail['season'] = exp_win_v_fail['season'].replace('Spring', 1)
    exp_win_v_fail['season'] = exp_win_v_fail['season'].replace('Summer', 2)
    exp_win_v_fail['season'] = exp_win_v_fail['season'].replace('Autumn', 3)
    exp_win_v_fail['season'] = exp_win_v_fail['season'].replace('Winter', 4)
    exp_win_v_fail = exp_win_v_fail[(exp_win_v_fail['season'] != 'Unknown')]
    
    # Turn the season column into a categorical
    exp_win_v_fail['season'] = exp_win_v_fail['season'].astype('category')
    exp_win_v_fail['season'].dtypes