Search code examples
matlabsizematchcell

Match cell arrays with different size based on two conditions in Matlab


RECCELL is a cell array with 8 columns and 30000 rows:

 C1     C2      C3          C4   C5          C6     C7    C8
'AA'    1997    19970102    1   'BACHE'     'MORI'  148   127
'AA'    1997    19970108    2   'MORGAN'     []     1595  0
'AA'    1997    19970224    3   'KEMSEC'    'FATHI' 1315  297

CONCELL is a cell array with 4 columns and 70000 rows:

 C1     C2      D3          D4
'AA'    1997    19970116    2,75
'AA'    1997    19970220    2,71
'AA'    1997    19970320    2,61

I would like to add to RECCELL the 4 columns of CONCELL only in case the C1s match and C3 and D3 (both dates) are the closest possible. For instance I would get in this example:

 C1     C2      C3         C4    C5           C6     C7    C8   C1      C2      D3          D4           

'AA'    1997    19970102    1   'BACHE'      'MORI'  148   127  'AA'    1997    19970116    2,75

'AA'    1997    19970108    2   'MORGAN'     []      1595  0    'AA'    1997    19970116    2,75

'AA'    1997    19970113    3   'KEMSEC'    'FATHI'  1315  297  'AA'    1997    19970220    2,71
  • To the first row of RECCELL corresponds the first row of CONCELL.
  • To the second row of RECCELL corresponds the first row of CONCELL.
  • To the third row of RECCELL corresponds the second row of CONCELL.

The code I have so far is:

[~, indCon, indREC] = intersect(CONCELL(:,1), RECCELL(:,1));
REC_CON=[RECCELL(indREC,:),CONCELL(indCon,:)];
NO_REC_CON= RECCELL(setdiff(1:size(RECCELL,1), indREC),:);

It's wrong because I cannot use intersect for a string element and because I am not considering the second condition, which is to choose the closest dates.

Can someone help me? Thank you


Solution

  • I would suggest to do this inside a for loop as the cells are very tall. (Note: it seems like the date format (C3/D3) in the cell is a double opposed to a string, thus needs to be converted first for using datenum)

    n=size(RECCELL,1);
    ind=zeros(n,1);
    rd=datenum(num2str(cell2mat(CONCELL(:,3))),'yyyymmdd'); % convert double to string
    for k=1:n
        a=find(ismember(CONCELL(:,1),RECCELL(k,1))==1);     % find indices of matching C1s
        if ~isempty(a)                                      % do only if there is a match for the C1s
            dnk=datenum(num2str(RECCELL{k,3}),'yyyymmdd');  % convert double to string
            [~,f]=min((rd(a)-dnk).^2);                      % find closest date of the subset a
            ind(k,1)=a(f);                                  % assign index of closest match to ind
            RECCELL(k,(end+1):(end+4))=CONCELL(ind(k,1),:); % add CONCELL to RECCELL, be aware that other rows will now display empty cells, and a row of RECCELL can keep 'growing'
        end
    end
    

    The vector ind contains the indices of the closest match in CONCELL for each entry in RECCELL. When it contains a 0, no match was found between the C1s.

    Edit: One possible solution to avoid increasing the number of columns of RECCELL if multiple CONCELL entries are added to the same RECCELL entry is the following which results in a adding a single column to the RECCELL matrix:

    n=size(RECCELL,1);
    RECCELL{1,end+1}=[];                                    % to add a single empty column to RECCELL
    ind=zeros(n,1);
    rd=datenum(num2str(cell2mat(CONCELL(:,3))),'yyyymmdd'); % convert double to string
    for k=1:n
        a=find(ismember(CONCELL(:,1),RECCELL(k,1))==1);     % find indices of matching C1s
        if ~isempty(a)                                      % do only if there is a match for the C1s
            dnk=datenum(num2str(RECCELL{k,3}),'yyyymmdd');  % convert double to string
            [~,f]=min((rd(a)-dnk).^2);                      % find closest date of the subset a
            ind(k,1)=a(f);                                  % assign index of closest match to ind
            if isempty(RECCELL{k,end})                      % if nothing is in this cell, add the CONCELL entry to it
                RECCELL{k,end}=CONCELL(ind(k,1),:);
            else                                            % if something is already in, add the new CONCELL entry to the cell
                RECCELL{k,end}(end+1,1:4)=CONCELL(ind(k,1),:);
            end
        end
    end