Search code examples
machine-learningnanlogistic-regressionmissing-datadata-cleaning

Dealing with NaN (missing) values for Logistic Regression- Best practices?


I am working with a data-set of patient information and trying to calculate the Propensity Score from the data using MATLAB. After removing features with many missing values, I am still left with several missing (NaN) values.

I get errors due to these missing values, as the values of my cost-function and gradient vector become NaN, when I try to perform logistic regression using the following Matlab code (from Andrew Ng's Coursera Machine Learning class) :

[m, n] = size(X);
X = [ones(m, 1) X];    
initial_theta = ones(n+1, 1);
[cost, grad] = costFunction(initial_theta, X, y);
options = optimset('GradObj', 'on', 'MaxIter', 400);

[theta, cost] = ...
    fminunc(@(t)(costFunction(t, X, y)), initial_theta, options);

Note: sigmoid and costfunction are working functions I created for overall ease of use.

The calculations can be performed smoothly if I replace all NaN values with 1 or 0. However I am not sure if that is the best way to deal with this issue, and I was also wondering what replacement value I should pick (in general) to get the best results for performing logistic regression with missing data. Are there any benefits/drawbacks to using a particular number (0 or 1 or something else) for replacing the said missing values in my data?

Note: I have also normalized all feature values to be in the range of 0-1.

Any insight on this issue will be highly appreciated. Thank you


Solution

  • As pointed out earlier, this is a generic problem people deal with regardless of the programming platform. It is called "missing data imputation".

    Enforcing all missing values to a particular number certainly has drawbacks. Depending on the distribution of your data it can be drastic, for example, setting all missing values to 1 in a binary sparse data having more zeroes than ones.

    Fortunately, MATLAB has a function called knnimpute that estimates a missing data point by its closest neighbor.

    From my experience, I often found knnimpute useful. However, it may fall short when there are too many missing sites as in your data; the neighbors of a missing site may be incomplete as well, thereby leading to inaccurate estimation. Below, I figured out a walk-around solution to that; it begins with imputing the least incomplete columns, (optionally) imposing a safe predefined distance for the neighbors. I hope this helps.

    function data = dnnimpute(data,distCutoff,option,distMetric)
    % data = dnnimpute(data,distCutoff,option,distMetric)
    %
    %   Distance-based nearest neighbor imputation that impose a distance
    %     cutoff to determine nearest neighbors, i.e., avoids those samples 
    %     that are more distant than the distCutoff argument.
    %
    %   Imputes missing data coded by "NaN" starting from the covarites 
    %     (columns) with the least number of missing data. Then it continues by 
    %     including more (complete) covariates in the calculation of pair-wise 
    %     distances.
    %
    %   option, 
    %       'median'      - Median of the nearest neighboring values
    %       'weighted'    - Weighted average of the nearest neighboring values
    %       'default'     - Unweighted average of the nearest neighboring values
    %
    %   distMetric,
    %       'euclidean'   - Euclidean distance (default)
    %       'seuclidean'  - Standardized Euclidean distance. Each coordinate
    %                       difference between rows in X is scaled by dividing
    %                       by the corresponding element of the standard
    %                       deviation S=NANSTD(X). To specify another value for
    %                       S, use D=pdist(X,'seuclidean',S).
    %       'cityblock'   - City Block distance
    %       'minkowski'   - Minkowski distance. The default exponent is 2. To
    %                       specify a different exponent, use
    %                       D = pdist(X,'minkowski',P), where the exponent P is
    %                       a scalar positive value.
    %       'chebychev'   - Chebychev distance (maximum coordinate difference)
    %       'mahalanobis' - Mahalanobis distance, using the sample covariance
    %                       of X as computed by NANCOV. To compute the distance
    %                       with a different covariance, use
    %                       D =  pdist(X,'mahalanobis',C), where the matrix C
    %                       is symmetric and positive definite.
    %       'cosine'      - One minus the cosine of the included angle
    %                       between observations (treated as vectors)
    %       'correlation' - One minus the sample linear correlation between
    %                       observations (treated as sequences of values).
    %       'spearman'    - One minus the sample Spearman's rank correlation
    %                       between observations (treated as sequences of values).
    %       'hamming'     - Hamming distance, percentage of coordinates
    %                       that differ
    %       'jaccard'     - One minus the Jaccard coefficient, the
    %                       percentage of nonzero coordinates that differ
    %       function      - A distance function specified using @, for
    %                       example @DISTFUN.
    %  
    if nargin < 3
        option = 'mean';
    end
    if nargin < 4
        distMetric = 'euclidean';
    end
    
    nanVals = isnan(data);
    nanValsPerCov = sum(nanVals,1);
    noNansCov = nanValsPerCov == 0;
    if isempty(find(noNansCov, 1))
        [~,leastNans] = min(nanValsPerCov);
        noNansCov(leastNans) = true;
        first = data(nanVals(:,noNansCov),:);
        nanRows = find(nanVals(:,noNansCov)==true); i = 1;
        for row = first'
            data(nanRows(i),noNansCov) = mean(row(~isnan(row)));
            i = i+1;
        end
    end
    
    nSamples = size(data,1);
    if nargin < 2
        dataNoNans = data(:,noNansCov);
        distances = pdist(dataNoNans);
        distCutoff = min(distances);
    end
    [stdCovMissDat,idxCovMissDat] = sort(nanValsPerCov,'ascend');
    imputeCols = idxCovMissDat(stdCovMissDat>0);
    % Impute starting from the cols (covariates) with the least number of 
    % missing data. 
    for c = reshape(imputeCols,1,length(imputeCols))    
        imputeRows = 1:nSamples;
        imputeRows = imputeRows(nanVals(:,c));   
        for r = reshape(imputeRows,1,length(imputeRows))
            % Calculate distances
            distR = inf(nSamples,1);
            %
            noNansCov_r = find(isnan(data(r,:))==0);
            noNansCov_r = noNansCov_r(sum(isnan(data(nanVals(:,c)'==false,~isnan(data(r,:)))),1)==0);
            %
            for i = find(nanVals(:,c)'==false)
                distR(i) = pdist([data(r,noNansCov_r); data(i,noNansCov_r)],distMetric);            
            end
            tmp = min(distR(distR>0));        
            % Impute the missing data at sample r of covariate c
            switch option
                case 'weighted'
                    data(r,c) = (1./distR(distR<=max(distCutoff,tmp)))' * data(distR<=max(distCutoff,tmp),c) / sum(1./distR(distR<=max(distCutoff,tmp)));
                case 'median'
                    data(r,c) = median(data(distR<=max(distCutoff,tmp),c),1);
                case 'mean'
                    data(r,c) = mean(data(distR<=max(distCutoff,tmp),c),1);
            end
            % The missing data in sample r is imputed. Update the sample 
            % indices of c which are imputed. 
            nanVals(r,c) = false;  
        end    
        fprintf('%u/%u of the covariates are imputed.\n',find(c==imputeCols),length(imputeCols));
    end