Search code examples
stringmatlabstrcmp

sumproduct on condition that text strings match


and thank you for showing interest in my post :)

Let me quickly describe the data I'm working with:

I'm important data from excel into my matlab script so that I end up with 3 vectors. These are:

  • "FIT_txt" (~250k rows containing mixed letters & numbers in each cell. repeated entries)
  • "FIT_num" (same number of rows as "FIT_txt" containing numbers only)
  • "sector" (~5k rows containing mixed letters & numbers in each cell. unique entries)

Now, what I want to achieve:

  1. create a vector that counts the number of times each row in "sector" is appearing in "FIT_txt". E.g.: say the first entry in "sector" is "AB10", then I want to create a vector that counts the number of times "AB10" occurs in "FIT_txt"; this should be saved in the first row of the new vector. The second row of the new vector counts the appearances of row 2 from "sector" (say "AB11") in "FIT_txt", etc.

  2. create a vector that adds all the numbers in "FIT_num" that occur in the same row as the entries in "FIT_txt" that correspond to a "sector" entry, and sort it accordingly. E.g.: "AB10" in row 1 of "sector" occurs 3 times in "FIT_txt" - it occurs in row 2, 500 & 2000. I want to add up rows 2, 500 & 2000 in "FIT_num" and put them into the first row of the new vector.

Now, this is a simplification of my problem; essentially, I am using a lot more data and I'm repeating this process to find a much larger number of new vectors/matrices. With regard to the simplification, I have managed to solve the problem by doing this:

units = zeros(length(sector),1);
installed = zeros(length(sector),1);    

for a = 1:length(sector)
        for z = 1:length(FIT_txt(:,1))

            if strcmp((FIT_txt(z,1)),(sector(a)))==1

                units(a,1) = units(a,1) + 1;
                installed(a,1) = installed(a,1) + FIT_num(z,1);

            end

        end
    end

Unfortunately, I fear this is very inefficient and takes way too long to calculate.

I have managed to achieve decent results for my first question (counting the appearances) using this:

units = zeros(length(sector),1);

for a = 1:length(sector)
    units(a,1) = sum(strcmp((FIT_txt(:,1)),(sector(a)))
end

That works fine (although still taking a bit longer than I would like), however, I don't know how to solve my second question (counting the values from "FIT_num" that correspond).

I would appreciate if you could assist me in finding a solution that is as efficient as possible in solving my problem.

Thank you very much in advance.

John


Solution

  • Let's solve this step by step: I'm assuming FIT_txt is a Nx1 cell array and sector is a Mx1 cell array.

    First find the rows in FIT_txt where sector(ii) appears in, and also how many times:

    ii=1;
    sector_occurrence = strfind(FIT_txt,sector(ii));
    

    This gives you N cells for each row of FIT_txt with in each cell the actual location at each row where sector(ii) is found. You just need the count per row, so use numel to get just that:

    cellfun(@numel,sector_occurrence );
    

    You can do this for every element of sector with a loop or with arrayfun:

    sector_occurrence = cell2mat(arrayfun(@(ii) cellfun(@numel,strfind(FIT_txt,sector{ii})),1:numel(sector),'uni',false'));
    

    Now you have a NxM matrix. The element at row i and column j tells how many times sector(j) occurs in FIT_txt(i).

    *reading your comments... Aah, but if a sector is found, FIT_txt is exactly equal to the sector, which allows you the simplify all the above this with strcmp and arrayfun:

    sector_occurrence = cell2mat(arrayfun(@(sectorii) strcmp(FIT_txt,sectorii), sector', 'uni',false))
    



    Now onto the summation:

    It's just a vector product of the occurrence vector with FIT_num, and you can get them all in one call by multiplying the sector_occurrence matrix with FIT_num:

    sumresult = sector_occurrence'*FIT_num;
    

    note the transpose operator ' because sector_occurrence was defined as NxM.

    Example

    >> sector=[{'AB10'} ; {'b'}];
    >> FIT_txt=[{'AB10'} ; {'a'} ; {'b'} ; {'ZX5b'} ; {'AB10'} ; {'b'}];
    >> FIT_num = (1:6)';
    
    >> sector_occurrence = strfind(FIT_txt,sector(ii))
    
    sector_occurrence =
    
         1     0
         0     0
         0     1
         0     0
         1     0
         0     1
    
    >> sumresult = sector_occurrence'*FIT_num
    
    sumresult =
    
         6
         9
    

    Of course, it has to be an exact match (which you said it was) b won't be equal to B and also not to bb, only b.

    Remark on large data set

    sector_occurrence is a logical array, so it doesn't need that much memory. But when executing the multiplication with FIT_num, it gets converted to float, which needs 8 times as much space. The end result is a small (in memory) vector, but the intermediary process can melt your pc. You can avoid this by doing the multiplication in a loop:

    sumresult=NaN(numel(sector),1); %preallocation is a good thing
    for ii=1:numel(sector)
        sumresult(ii)=sector_occurrence(:,ii)'*FIT_num;
    end