I am having a hard time understanding what is going on with my sql to pandas data frame datatypes:
I don't understand why my dataset is coming back with these count and sum fields as object as I cannot use operations such as df.total_deductions.max(). I am failing to understand what could be causing this or how to fix it.
query = """ SELECT
date,
user_id,
sum(income) total_spend,
sum(deductions) total_deductions ,
count(distinct stores) number_stores
FROM db_table GROUP BY user_id """
df = pd.read_sql(query, jdbc_connection)
df.dtypes:
date: object
user_id: object
total_spend: float
total_deductions: object
number_stores: object
I looked at the data. I can't seem anything to indicate that would cause these counts or sums to be an object rather than a numeric value.
I tried using pd.to_numeric( each_of_my_columns, error ='coerce') but this coerce options forces them to be "NaN".
Can someone hypothesize what could be going on here or how to solve this, since I assume I am doing something that should be obvious incorrectly?
There can be problem values have start or end whitespaces, which can be removed by strip
:
df['number_stores']=pd.to_numeric(df['number_stores'].astype(str).str.strip(),error='coerce')
You can check it by convert to list
:
print (df['number_stores'].tolist()[:20])