Search code examples
matlabloopsfor-looprankingpercentile

How to use the Percentile Function for Ranking purpose? (Matlab)


I have the following 606 x 274 table: see here


Goal:

For every date calculate lower and upper 20% percentiles and, based on the outcome, create 2 new variables, e.g. 'L' for "lower" and 'U' for "upper", which contain the ticker names as seen in the header of the table.

Step by step:

% Replace NaNs with 'empty' for the percentile calculation (error: input to be cell array)
     T(cellfun(@isnan,T)) = {[]}
% Change date format
     T.Date=[datetime(T.Date, 'InputFormat', 'eee dd-MMM-yyyy')];
% Take row by row 
     for row=1:606
% If Value is in upper 20% percentile create new variable 'U' that contains the according ticker names.
% If Value is in lower 20% percentile create new variable 'L' that contains the according ticker names.
     end;

So far, experimenting with 'prctile' only yielded a numeric outcome, for a single column. Example:

Y = prctile(T.A2AIM,20,2);

Thanks for your help and ideas!


Solution

  • Generally speaking, if you have an array of numbers:

    a = [4 2 1 8 -2];
    

    percentiles can be computed by first sorting the array and then attempting to access the index supplied in the percentile. So prctile(a,20)'s functionality could in principle be replaced by

    b = sort(a);
    ind = round(length(b)*20/100);
    if ind==0
        ind = 1;
    end
    b = b(ind);
    % b = -2
    

    However, prctile does a bit more of fancy magic interpolating the input vector to get a value that is less affected by array size. However, you can use the idea above to find the percentile splitting columns. If you chose to do it like I said above, what you want to do to get the headers that correspond to the 20% and 80% percentiles is to loop through the rows, remove the NaNs, get the indeces of the sort on the remaining values and get the particular index of the 20% or 80% percentile. Regrettably, I have an old version of Matlab that does not support tables so I couldn't verify if the header names are returned correctly, but the idea should be clear.

    L = cell(size(T,1),1);
    U = cell(size(T,1),1);
    for row=1:size(T,1)
        row_values = T{row,:};
        row_values = row_values(2:end); % Remove date column
        non_nan_indeces = find(~isnan(row_values));
        if not(isempty(non_nan_indeces))
            [row_values,sorted_indeces] = sort(row_values(non_nan_indeces));
            % The +1 is because we removed the date column
            L_ind = non_nan_indeces(sorted_indeces(1:round(0.2*length(row_values))))+1;
            U_ind = non_nan_indeces(sorted_indeces(round(0.8*length(row_values)):end))+1;
            % I am unsure about this part
            L{row} = T.Properties.VariableNames(L_ind);
            U{row} = T.Properties.VariableNames(U_ind);
        else
            L{row} = nan;
            U{row} = nan;
        end
    end;
    

    If you want to use matlab's prctile, you would have to find the returned value's index doing something like this:

    L = cell(size(T,1),1);
    U = cell(size(T,1),1);
    for row=1:size(T,1)
        row_values = T{row,:};
        row_values = row_values(2:end); % Remove date column
        non_nan_indeces = find(~isnan(row_values));
        if not(isempty(non_nan_indeces))
            [row_values,sorted_indeces] = sort(row_values(non_nan_indeces));
            L_val = prctile(row_values(non_nan_indeces),20);
            U_val = prctile(row_values(non_nan_indeces),80);
            % The +1 is because we removed the date column
            L_ind = non_nan_indeces(sorted_indeces(find(row_values<=L_val)))+1;
            U_ind = non_nan_indeces(sorted_indeces(find(row_values>=U_val)))+1;
            % I am unsure about this part
            L{row} = T.Properties.VariableNames(L_ind);
            U{row} = T.Properties.VariableNames(U_ind);
        else
            L{row} = nan;
            U{row} = nan;
        end
    end;