machine-learning classification octave logistic-regression

Cost function for logistic regression: weird/oscillating cost history

Background and my thought process:

I wanted to see if I could utilize logistic regression to create a hypothesis function that could predict recessions in the US economy by looking at a date and its corresponding leading economic indicators. Leading economic indicators are known to be good predictors of the economy.

To do this, I got data from OECD on the composite leading (economic) indicators from January, 1970 to July, 2021 in addition to finding when recessions occurred from 1970 to 2021. The formatted data that I use for training can be found further below.

Knowing the relationship between a recession and the Date/LEI wouldn't be a simple linear relationship, I decided to make more parameters for each datapoint so I could fit a polynominal equation to the data. Thus, each datapoint has the following parameters: Date, LEI, LEI^2, LEI^3, LEI^4, and LEI^5.

The Problem:

When I attempt to train my hypothesis function, I get a very strange cost history that seems to indicate that I either did not implement my cost function correctly or that my gradient descent was implemented incorrectly. Below is the imagine of my cost history:

I have tried implementing the suggestions from this post to fix my cost history, as originally I had the same NaN and Inf issues described in the post. While the suggestions helped me fix the NaN and Inf issues, I couldn't find anything to help me fix my cost function once it started oscillating. Some of the other fixes I've tried are adjusting the learning rate, double checking my cost and gradient descent, and introducing more parameters for datapoints (to see if a higher-degree polynominal equation would help).

My Code The main file is predictor.m.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Program: Predictor.m
% Author: Hasec Rainn
% Desc: Predictor.m uses logistic regression
% to predict when economic recessions will occur
% in the United States. The data it uses is from the past 50 years.
%
% In particular, it uses dates and their corresponding economic leading
% indicators to learn a non-linear hypothesis function to fit to the data.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


LI_Data = dlmread("leading_indicators_formatted.csv"); %Get LI data
RD_Data = dlmread("recession_dates_formatted.csv"); %Get RD data

%our datapoints of interest: Dates and their corresponding
%leading Indicator values.
%We are going to increase the number of parameters per datapoint to allow
%for a non-linear hypothesis function. Specifically, let the 3rd, 4th
%5th, and 6th columns represent LI^2, LI^3, LI^4, and LI^5 respectively

X = LI_Data;        %datapoints of interest (row = 1 datapoint)
X = [X, X(:,2).^2];  %Adding LI^2
X = [X, X(:,2).^3];  %Adding LI^3
X = [X, X(:,2).^4];  %Adding LI^4
X = [X, X(:,2).^5];  %Adding LI^5

%normalize data
X(:,1) = normalize( X(:,1) );
X(:,2) = normalize( X(:,2) );
X(:,3) = normalize( X(:,3) );
X(:,4) = normalize( X(:,4) );
X(:,5) = normalize( X(:,5) );
X(:,6) = normalize( X(:,6) );


%What we want to predict: if a recession happens or doesn't happen
%for a corresponding year
Y = RD_Data(:,2); %row = 1 datapoint

%defining a few useful variables:
nIter = 4000;       %how many iterations we want to run gradient descent for
ndp = size(X, 1);   %number of data points we have to work with
nPara = size(X,2);  %number of parameters per data point
alpha = 1;           %set the learning rate to 1

%Defining Theta
Theta = ones(1, nPara); %initialize the weights of Theta to 1

%Make a cost history so we can see if gradient descent is implemented
%correctly
costHist = zeros(nIter, 1);


for i = 1:nIter
  costHist(i, 1) = cost(Theta, Y, X);
  Theta = Theta - (sum((sigmoid(X * Theta') - Y) .* X));
end

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Function: Cost
% Author: Hasec Rainn
% Parameters: Theta (vector), Y (vector), X (matrix)
% Desc: Uses Theta, Y, and X to determine the cost of our current
%       hypothesis function H_theta(X). Uses manual loop approach to
%       avoid errors that arrise from log(0).
%       Additionally, limits the range of H_Theta to prevent Inf
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


function expense = cost(Theta, Y, X)
  m = size(X, 1); %number of data points
  hTheta = sigmoid(X*Theta'); %hypothesis function
  
  %limit the range of hTheta to [10^-50, 0.9999999999999]
  for i=1:size(hTheta, 1)
    if (hTheta(i) <= 10^(-50))
      hTheta(i) = 10^(-50);
    endif
    
        if (hTheta(i) >= 0.9999999999999)
      hTheta(i) = 0.9999999999999;
    endif
  endfor
  
  expense = 0;
  
  for i = 1:m
    
    if Y(i) == 1
      expense = expense + -log(hTheta(i));
    endif
    
    if Y(i) == 0
      expense = expense + -log(1-hTheta(i));
    endif
    
  endfor
  
endfunction


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Function: normalization
% Author: Hasec Rainn
% Parameters: vector
% Desc: Takes in an input and normalizes its value(s)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function n = normalize(data)
   dMean = mean(data);
   dStd = std(data);
   n = (data - dMean) ./ dStd;
endfunction 


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Function: Sigmoid
% Author: Hasec Rainn
% Parameters: scalar, vector, or matrix
% Desc: Takes an input and forces its value(s) to be between
%       0 and 1. If a matrix or vector, sigmoid is applied to
%       each element.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function result = sigmoid(z)
  result = 1 ./ ( 1 + e .^(-z) );
endfunction

The data I used for my learning process can be found here: formatted LI data and recession dates data.

Solution

The problem you're running into here is your gradient descent function.

In particular, while you correctly calculate the cost portion (aka, (hTheta - Y) or (sigmoid(X * Theta') - Y) ), you do not calculate the derivative of the cost correctly; in Theta = Theta - (sum((sigmoid(X * Theta') - Y) .* X)), the .*X is not correct.

The derivative is equivalent to the cost of each datapoint (found in the vector hTheta - Y) multiplied by their corresponding parameter j, for every parameter. For more information, check out this article.