I have a set of values in my dataframe that are objects written as "100%", "75%", etc I'm converting these to integers (100, 75, etc)
this is the function I have
def convert_object_to_int(column):
column = column.astype(str)
column = column.str.rstrip('%')
column = pd.to_numeric(column, errors='coerce')
column = column.fillna(column.median())
return column.astype(int)
After calling the function with this:
a1data.loc[:, 'Total(%)'] = convert_object_to_int(a1data['Total(%)'])
My Total(%) column still shows up as an Object when I check a1data.dtypes()
The numbers HAVE changed, and I am able to use them in visualisations and stuff, HOWEVER, I am unable to operate basic descriptive statistics on the data as it gives me the categorical descriptions instead.
I'm very much a beginner so any pointers would be greatly appreciated.
I've tried converting to floats instead as I read there used to be some issues with int64. A lot of the lines in the function kinda feel unnecessary, but the numbers weren't changing properly until all those lines were there. The numbers are now showing what I want them to but they still count as objects for descriptive statistics and other functions.
This is because you assign to the existing Series with a1data.loc[:, 'Total(%)']
, which maintains the original dtype. Instead, overwrite with a new Series:
a1data['Total(%)'] = convert_object_to_int(a1data['Total(%)'])
print(a1data.dtypes)
# Total(%) int64
# dtype: object
Also note that you do not need to reassign all intermediates in your function, you could simplify it to:
def convert_object_to_int(column):
column = pd.to_numeric(column.astype(str)
.str.rstrip('%'),
errors='coerce')
return column.fillna(column.median()).astype(int)
Or without any variable:
def convert_object_to_int(column):
return (pd.to_numeric(column.astype(str)
.str.rstrip('%'),
errors='coerce')
.pipe(lambda x: x.fillna(x.median()))
.astype(int)
)