Search code examples
pythonpython-2.7pandastype-conversionpandasql

pandas Numerical columns being treated as object and won't coerce


I am having a hard time understanding what is going on with my sql to pandas data frame datatypes:

  • User_ID is expected to be an 'object'.. which is fine.
  • DATE is in the format of 201612, 201701, 201702, etc. (year_number+month_number)
  • INCOME is all numerical values such as 57.25, 50, 100.10, etc
  • DEDUCTIONS is also numerical
  • COUNT of STORES would naturally be an integer...

I don't understand why my dataset is coming back with these count and sum fields as object as I cannot use operations such as df.total_deductions.max(). I am failing to understand what could be causing this or how to fix it.

query = """ SELECT  
  date, 
   user_id,  
  sum(income) total_spend, 
 sum(deductions) total_deductions ,
  count(distinct stores) number_stores 
   FROM  db_table GROUP BY user_id """

df = pd.read_sql(query, jdbc_connection)

df.dtypes: 
date:  object 
user_id:  object 
total_spend:   float 
total_deductions:  object 
number_stores: object

I looked at the data. I can't seem anything to indicate that would cause these counts or sums to be an object rather than a numeric value.
I tried using pd.to_numeric( each_of_my_columns, error ='coerce') but this coerce options forces them to be "NaN".

Can someone hypothesize what could be going on here or how to solve this, since I assume I am doing something that should be obvious incorrectly?


Solution

  • There can be problem values have start or end whitespaces, which can be removed by strip:

    df['number_stores']=pd.to_numeric(df['number_stores'].astype(str).str.strip(),error='coerce')
    

    You can check it by convert to list:

    print (df['number_stores'].tolist()[:20])