Search code examples
pythonpandasdataframeregressionstatsmodels

Regress each column individually with the last column of a dataframe in Pandas


returns = df.pct_change(1) * 100

returns is the dataframe with dates as index, where I want to regress each column individually with the last column (which is ^BSESN) of the dataframe. Since there are more than 700 columns, I want to iterate through the columns. I also want to store the residuals, after each regression. I used various versions of the following but I'm getting the same error constantly. I need to remove the rows which have NaN values during each individual regression (rather than removing rows with any column having nan value).

residuals = {}
for column in returns.columns[:-1]:
  selected = returns[['^BSESN', column]].dropna()
  reg = sm.OLS(selected[column], returns['^BSESN']).fit()
  residuals[column] = reg.residuals

But I keep getting the following error:

MissingDataError: exog contains inf or nans

How can I remove the error? Any guidance on the best way to do this would be much appreciated.


Solution

  • Change the OLS call to

    reg = sm.OLS(selected[column], selected['^BSESN']).fit()
    

    See that returns['^BSESN'] as in your original post didn't have any missing value dropped.

    As some columns do not have any rows matching with ^BSESN, you can add an if to ignore those columns:

    residuals = {}
    for column in returns.columns[:-1]:
      selected = returns[['^BSESN', column]].dropna()
      if selected.shape[0] > 0:
        reg = sm.OLS(selected[column], selected['^BSESN']).fit()
        residuals[column] = reg.residuals