Search code examples
pythonpandascorrelationvalueerror

.corr results in ValueError: could not convert string to float


I'm getting this very strange error when trying to follow the following exercise on using the corr() method in Python

https://www.geeksforgeeks.org/python-pandas-dataframe-corr/

Specifically, when I try to run the following code: df.corr(method ='pearson')

The error message offers no clue. I thought the corr() method was supposed to automatically ignore strings and empty values etc.

Traceback (most recent call last):
  File "<pyshell#6>", line 1, in <module>
    df.corr(method='pearson')
  File "C:\Users\d.o\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\frame.py", line 10059, in corr
    mat = data.to_numpy(dtype=float, na_value=np.nan, copy=False)
  File "C:\Users\d.o\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\frame.py", line 1838, in to_numpy
    result = self._mgr.as_array(dtype=dtype, copy=copy, na_value=na_value)
  File "C:\Users\d.o\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\internals\managers.py", line 1732, in as_array
    arr = self._interleave(dtype=dtype, na_value=na_value)
  File "C:\Users\d.o\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\internals\managers.py", line 1794, in _interleave
    result[rl.indexer] = arr
ValueError: could not convert string to float: 'Avery Bradley'

Solution

  • When I try to replicate this behavior, the corr() method works OK but spits out a warning (shown below) that warns that the ignoring of non-numeric columns will be removed in the future. Perhaps the future has arrived?

    I've got pandas version 1.5.3.

    You may need to just specify which columns to use--which is actually a better way to do it rather than rely on pd to do this for you. You can do that by just supplying a list of the columns of interest as an index (shown below.)

    In [1]: import pandas as pd
    
    In [2]: data = {'name': ['bob', 'cindy', 'tom'],
       ...:         'x'   : [ 1,     2,      3   ],
       ...:         'y'   : [ 6.5,   8.9,    12.0]}
    
    In [3]: df = pd.DataFrame(data)
    
    In [4]: df
    Out[4]: 
        name  x     y
    0    bob  1   6.5
    1  cindy  2   8.9
    2    tom  3  12.0
    
    In [5]: df.describe()
    Out[5]: 
             x          y
    count  3.0   3.000000
    mean   2.0   9.133333
    std    1.0   2.757414
    min    1.0   6.500000
    25%    1.5   7.700000
    50%    2.0   8.900000
    75%    2.5  10.450000
    max    3.0  12.000000
    
    In [6]: df.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 3 entries, 0 to 2
    Data columns (total 3 columns):
     #   Column  Non-Null Count  Dtype  
    ---  ------  --------------  -----  
     0   name    3 non-null      object 
     1   x       3 non-null      int64  
     2   y       3 non-null      float64
    dtypes: float64(1), int64(1), object(1)
    memory usage: 200.0+ bytes
    
    In [7]: df.corr(method='pearson')
    <ipython-input-7-432dd9d4238b>:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
      df.corr(method='pearson')
    Out[7]: 
              x         y
    x  1.000000  0.997311
    y  0.997311  1.000000
    
    In [8]: df[['x', 'y']].corr(method='pearson')
    Out[8]: 
              x         y
    x  1.000000  0.997311
    y  0.997311  1.000000