Search code examples
pythonpandasdataframelisttypeerror

How to iterate over a dataframe with list passed as function argument?


TV newspaper radio Sales
87 13 23 123
89 09 34 169

Hi,all! first post, so I apologize for the formatting.

I created this function and it works when I set a feature equal to a single column name, but I want to be able to pass in a list for multiple features (tv and radio) for example.

def coefficient_intercept(df,features,target):


  ybar = np.mean(df[target])
  xbar = np.mean(df[features])

  yi = df[target]
  xi = df[features]

  coefficient =  sum((yi - ybar) * (xi - xbar))  /  sum((xi - xbar)**2)
  intercept = ybar - coefficient * xbar

  index = [features]
  data = np.array([[coefficient],[intercept]])
  
  return pd.DataFrame(data = data.T, index=index,columns =['coefficient','intercept'])

code/output in colab

I've tried a few different loops and a couple iloc methods that didn't get me far. Mainly, I keep running into this error when I pass in ['TV','newspaper'] for example, and I want to understand where the string or'str' part is coming from? Thanks for your time!

/usr/local/lib/python3.10/dist-packages/numpy/core/fromnumeric.py:3430: FutureWarning: In a future version, DataFrame.mean(axis=None) will return a scalar mean over the entire DataFrame. To retain the old behavior, use 'frame.mean(axis=0)' or just 'frame.mean()'
return mean(axis=axis, dtype=dtype, out=out, **kwargs)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
 <ipython-input-31-25f7ee028e22> in <cell line: 1>()
1 coefficient_intercept(df,features = ['TV','newspaper'],target = 'sales')

 <ipython-input-28-5878e49e78b4> in coefficient_intercept(df, features, target)
15       xi = df[features]
16 
17       coefficient =  sum((yi - ybar) * (xi - xbar))  /  sum((xi - xbar)**2)
18       intercept = ybar - coefficient * xbar
19 

TypeError: unsupported operand type(s) for +: 'int' and 'str
def coefficient_intercept(df,features,target):


  for column in df.columns:
    if column in features:

      ybar = np.mean(df[target])
      xbar = np.mean(df[features])

      yi = df[target]
      xi = df[features]

      coefficient =  sum((yi - ybar) * (xi - xbar))  /  sum((xi - xbar)**2)
      intercept = ybar - coefficient * xbar

      index = [features]
      data = np.array([[coefficient],[intercept]])
      
      
      print(pd.DataFrame(data = data.T, index=index,columns =['coefficient','intercept']))

Solution

  • If you want to generalize to more features, you have to use 2D-arrays even if you have only one feature:

    def coefficient_intercept(df, features, target):
    
      # Convert as 2D-array if it's not already the case
      features = features if isinstance(features, list) else [features]
      target = [target]
    
      # Get values as numpy array
      yi = df[target].values
      xi = df[features].values
    
      # Compute the mean
      ybar = np.mean(yi, axis=0)
      xbar = np.mean(xi, axis=0)
    
      # Linear Regression
      coefficient =  sum((yi - ybar) * (xi - xbar))  /  sum((xi - xbar)**2)
      intercept = ybar - coefficient * xbar
    
      # Build your dataframe
      return pd.DataFrame({'coefficient': coefficient,
                           'intercept': intercept},
                          index=features)
    

    Output:

    >>> coefficient_intercept(df, 'radio', 'Sales')
           coefficient  intercept
    radio     4.181818  26.818182
    
    >>> coefficient_intercept(df, ['radio', 'TV'], 'Sales')
           coefficient    intercept
    radio     4.181818    26.818182
    TV       23.000000 -1878.000000