I have the following 606 x 274 table: see here
Goal:
For every date calculate lower and upper 20% percentiles and, based on the outcome, create 2 new variables, e.g. 'L' for "lower" and 'U' for "upper", which contain the ticker names as seen in the header of the table.
Step by step:
% Replace NaNs with 'empty' for the percentile calculation (error: input to be cell array)
T(cellfun(@isnan,T)) = {[]}
% Change date format
T.Date=[datetime(T.Date, 'InputFormat', 'eee dd-MMM-yyyy')];
% Take row by row
for row=1:606
% If Value is in upper 20% percentile create new variable 'U' that contains the according ticker names.
% If Value is in lower 20% percentile create new variable 'L' that contains the according ticker names.
end;
So far, experimenting with 'prctile' only yielded a numeric outcome, for a single column. Example:
Y = prctile(T.A2AIM,20,2);
Thanks for your help and ideas!
Generally speaking, if you have an array of numbers:
a = [4 2 1 8 -2];
percentiles can be computed by first sorting the array and then attempting to access the index supplied in the percentile. So prctile(a,20)
's functionality could in principle be replaced by
b = sort(a);
ind = round(length(b)*20/100);
if ind==0
ind = 1;
end
b = b(ind);
% b = -2
However, prctile
does a bit more of fancy magic interpolating the input vector to get a value that is less affected by array size. However, you can use the idea above to find the percentile splitting columns. If you chose to do it like I said above, what you want to do to get the headers that correspond to the 20% and 80% percentiles is to loop through the rows, remove the NaNs, get the indeces of the sort on the remaining values and get the particular index of the 20% or 80% percentile. Regrettably, I have an old version of Matlab that does not support tables so I couldn't verify if the header names are returned correctly, but the idea should be clear.
L = cell(size(T,1),1);
U = cell(size(T,1),1);
for row=1:size(T,1)
row_values = T{row,:};
row_values = row_values(2:end); % Remove date column
non_nan_indeces = find(~isnan(row_values));
if not(isempty(non_nan_indeces))
[row_values,sorted_indeces] = sort(row_values(non_nan_indeces));
% The +1 is because we removed the date column
L_ind = non_nan_indeces(sorted_indeces(1:round(0.2*length(row_values))))+1;
U_ind = non_nan_indeces(sorted_indeces(round(0.8*length(row_values)):end))+1;
% I am unsure about this part
L{row} = T.Properties.VariableNames(L_ind);
U{row} = T.Properties.VariableNames(U_ind);
else
L{row} = nan;
U{row} = nan;
end
end;
If you want to use matlab's prctile
, you would have to find the returned value's index doing something like this:
L = cell(size(T,1),1);
U = cell(size(T,1),1);
for row=1:size(T,1)
row_values = T{row,:};
row_values = row_values(2:end); % Remove date column
non_nan_indeces = find(~isnan(row_values));
if not(isempty(non_nan_indeces))
[row_values,sorted_indeces] = sort(row_values(non_nan_indeces));
L_val = prctile(row_values(non_nan_indeces),20);
U_val = prctile(row_values(non_nan_indeces),80);
% The +1 is because we removed the date column
L_ind = non_nan_indeces(sorted_indeces(find(row_values<=L_val)))+1;
U_ind = non_nan_indeces(sorted_indeces(find(row_values>=U_val)))+1;
% I am unsure about this part
L{row} = T.Properties.VariableNames(L_ind);
U{row} = T.Properties.VariableNames(U_ind);
else
L{row} = nan;
U{row} = nan;
end
end;