Search code examples
pythonpandasenumssklearn-pandas

Python Dataframe column name issues with StrEnum and sklearn regression


I have been seeing many warnings from sklearn type e.g:

FutureWarning: Feature names only support names that are all strings. Got feature names with dtypes: ['MyNewNames']. An error will be raised in 1.2.

while using StrEnum for the names of my features in regression anylises. I wrote the following code to ilustrate the case:

from enum import auto
from strenum import StrEnum
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression


class MyNewNames(StrEnum):
   MyCrim = auto()
   MyZN =  auto()
   MyIndus = auto()
        
class ComputeRegression():
  
   def SetUP(self,df:pd.DataFrame):
      variables = [[MyNewNames.MyCrim,MyNewNames.MyZN],[MyNewNames.MyCrim,MyNewNames.MyIndus]]

      for value in variables:
        dX = df[value]    
        dY = df["AGE"]
        self.ComputeRegression(dX,dY)
        
   def ComputeRegression(self,dX,dY):
        
      model = linear_model.LinearRegression()
      model.fit(X = dX, y = dY)
      predicted = model.predict(dX) # I see warnings when calling this line
      print(predicted) 

boston = datasets.load_boston()

df = pd.DataFrame(data= boston['data'], columns= boston['feature_names'])
df[MyNewNames.MyCrim] = df["CRIM"]
df[MyNewNames.MyZN] = df["ZN"]
df[MyNewNames.MyIndus] = df["INDUS"]

cr = ComputeRegression()
cr.SetUP(df)

It seems the issue is when I create a new column in the dataframe using the StrEnum, since when I change the following part of the code to:

df[MyNewNames.MyCrim.value] = df["CRIM"]
df[MyNewNames.MyZN.value] = df["ZN"]
df[MyNewNames.MyIndus.value] = df["INDUS"]

The warnings disapear. Can anyone explain to me why creating a new dataframe column with an enum "MyNewNames.Something" causes issues with the naming of the df column, while accessing a column like df[MyNewNames.MyIndus] is not a problem ?

Thanks!


Solution

  • The likely cause is that sklearn is doing a type check instead of an isinstance check for column names. That is:

    if type(column_name) is str
    

    instead of

    if isinstance(column_name, str)