Search code examples
stringmatlabcountgroup-bycell-array

Change code to fit strings - Matlab


I have the following code:

NI1=[NI{:,1} NI{:,2} NI{:,3}];
[~,NI2]=sort(NI1(:,2));
NI1=NI1(NI2,:);
NI1((NI1(:,3) == 0),:) = [];
NI1=unique(NI1(:,1:3),'rows');
NI3= unique(NI1(:,1:2),'rows')

for mj=1:size(NI3,1)
    NI3(mj,3)=sum(NI1(:,1) == NI3(mj,1) & NI1(:,2)==NI3(mj,2));
end

My initial cell-array NI1 has in collumns: 1) the year; 2) a code that corresponds to a bank 3) a code that corresponds to the workers of the bank. EXAMPLE:

c1      c2  c3
1997    3   850
1997    3   1024
1997    3   5792

My output NI3 counts how many analysts (c3), for the different years (c1) are working in each bank (c2), for instance:

c1      c2  c3
1997    3   14
1997    7   84
1997    11  15
1998    4   1
1998    15  10
1998    3   12
1999    11  17

Now I am trying to apply exactly the same code, but my last column (c3) is a string so initial cell array fir_ins is the following:

1997    3   'ACAD'
1997    3   'ADCT'
1997    3   'ADEX'

I want to obtain exactly the same output as in NI3, but I have to change the code, since my last column is a string.

I am only missing the last part, this is the code I have so far.

ESTIMA=num2cell(I{:,6});
ANALY=num2cell(I{:,7});
YEAR = num2cell(T_ANNDAT3);
fir_ins=[YEAR ESTIMA I{:,1}];
fir_ins= sortrows(fir_ins,2);
[~, in2,~] = unique(strcat(fir_ins(:,2),fir_ins(:, 3)));
fir_ins = fir_ins(in2,:);
fir_ins= sortrows(fir_ins,[1 2]);
fir_ins2=fir_ins(:,1:2);
fir_ins2=unique(cell2mat(fir_ins2(:,1:2)),'rows');

This part is not working:

for jm=1:size(fir_ins2,1)
    fir_ins2(jm,3)=sum(cell2mat(fir_ins(:,1))) == fir_ins2(jm,1) & cell2mat(fir_ins(:,2))==cell2mat(fir_ins2(jm,2));
end

Solution

  • You can perform this "aggregation" more efficiently with the help of accumarray function. The idea is to map the first two columns (row primary keys) into subscripts (indices starting from 1), then pass those subscripts to accumarray to do the counting.

    Below is an example to illustrate. First I start by generating some random data resembling yours:

    % here are the columns
    n = 150;
    c1 = sort(randi([1997 1999], [n 1]));           % years
    c2 = sort(randi([3 11], [n 1]));                % bank code
    c3 = randi(5000, [n 1]);                        % employee ID as a number
    c4 = cellstr(char(randi(['A' 'Z']-0, [n,4])));  % employee ID as a string
    
    % combine records (NI)
    X = [c1 c2 c3];                                 % the one with numeric worker ID
    X2 = [num2cell([c1 c2]) c4]; % {c1 c3 c4}       % the one with string worker ID
    

    Note that for our purposes, it doesn't matter if the workers ID column is expressed as numbers or string; we won't be using them, only the first two columns that represent the "primary keys" of the rows are used:

    % find the unique primary keys and their subscript mapping
    [years_banks,~,ind] = unique([c1 c2], 'rows');
    
    % count occurences (as in SQL: SELECT COUNT(..) FROM .. GROUPT BY ..)
    counts = accumarray(ind, 1);
    
    % build final matrix: years, bank codes, counts
    M = [years_banks counts];
    

    I got the following result with my fake data:

    >> M
    M =
            1997           3          13
            1997           4          11
            1997           5          15
            1997           6          14
            1997           7           4
            1998           7          11
            1998           8          24
            1998           9          15
            1999           9           1
            1999          10          22
            1999          11          20