Search code examples
matlablarge-data

Assign labels based on given examples for a large dataset effectively


I have matrix X (100000 X 10) and vector Y (100000 X 1). X rows are categorical and assume values 1 to 5, and labels are categorical too (11 to 20);

The rows of X are repetitive and there are only ~25% of unique rows, I want Y to have statistical mode of all the labels for a particular unique row.

And then there comes another dataset P (90000 X 10), I want to predict labels Q based on the previous exercise.

What I tried is finding unique rows of X using unique in MATLAB, and then assign statistical mode of each of these labels for the unique rows. For P, I can use ismember and carry out the same.

The issue is in the size of the dataset and it takes an 1.5-2 hours to complete the process. Is there a vectorize version possible in MATLAB?

Here is my code:

[X_unique,~,ic] = unique(X,'rows','stable');
labels=zeros(length(X_unique),1);
for i=1:length(X_unique)
    labels(i)=mode(Y(ic==i));
end

Q=zeros(length(P),1);
for j=1:length(X_unique)
    Q(all(repmat(X_unique(j,:),length(P),1)==P,2))=label(j);
end

Solution

  • You will be able to accelerate your first loop a great deal if you replace it entirely with:

    labels = accumarray(ic, Y, [], @(y) mode(y));
    

    The second loop can be accelerated by using all(bsxfun(@eq, X_unique(i,:), P), 2) inside Q(...). This is a good vectorized approach assuming your arrays are not extremely large w.r.t. the available memory on your machine. In addition, to save more time, you could use the unique trick you did with X on P, run all the comparisons on a much smaller array:

    [P_unique, ~, IC_P] = unique(P, 'rows', 'stable');
    

    EDIT: to compute Q_unique in the following way: and then convert it back to the full array using:

    Q_unique = zeros(length(P_unique),1);
    for i = 1:length(X_unique)
        Q_unique(all(bsxfun(@eq, X_unique(i,:), P_unique), 2)) = labels(i)
    end
    

    and convert back to Q_full to match the original P input:

    Q_full = Q_unique(IC_P);
    

    END EDIT

    Finally, if memory is an issue, in addition to everything above, you might want you use a semi-vectorized approach inside your second loop:

    for i = 1:length(X_unique)
        idx = true(length(P), 1);
        for j = 1:size(X_unique,2)
            idx = idx & (X_unique(i,j) == P(:,j));
        end
        Q(idx) = labels(i);
    %    Q(all(bsxfun(@eq, X_unique(i,:), P), 2)) = labels(i);
    end
    

    This would take about x3 longer compared with bsxfun but if memory is limited then you gotta pay with speed.

    ANOTHER EDIT

    Depending on your version of Matlab, you could also use containers.Map to your advantage by mapping textual representations of the numeric sequences to the calculated labels. See example below.

    % find unique members of X to work with a smaller array
    [X_unique, ~, IC_X] = unique(X, 'rows', 'stable');
    % compute labels
    labels = accumarray(IC_X, Y, [], @(y) mode(y)); 
    % convert X to cellstr -- textual representation of the number sequence
    X_cellstr = cellstr(char(X_unique+48)); % 48 is ASCII for 0
    % map each X to its label
    X_map = containers.Map(X_cellstr, labels);
    % find unique members of P to work with a smaller array
    [P_unique, ~, IC_P] = unique(P, 'rows', 'stable');
    % convert P to cellstr -- textual representation of the number sequence
    P_cellstr = cellstr(char(P_unique+48)); % 48 is ASCII for 0
    % --- EDIT --- avoiding error on missing keys in X_map --------------------
    % find which P's exist in map
    isInMapP = X_map.isKey(P_cellstr);
    % pre-allocate Q_unique to the size of P_unique (can be any value you want)
    Q_unique = nan(size(P_cellstr)); % NaN is safe to use since not a label
    % find the labels for each P_unique that exists in X_map
    Q_unique(isInMapP) = cell2mat(X_map.values(P_cellstr(isInMapP)));
    % --- END EDIT ------------------------------------------------------------
    % convert back to full Q array to match original P
    Q_full = Q_unique(IC_P);
    

    This takes about 15 seconds to run on my laptop. Most of which is consumed by computation of mode.