Search code examples
matlabpivot-tablecell-array

Create new variable based on existing columns of a cell in Matlab


I have a cell-array with 600 000 rows and 5 columns. In the following example I only present 3 different codes and a period of 5 years. Input:

c1   c2        c3        c4  c5 

1   2006    20060425    559 'IA'
1   2007    20070129    559 'LO'
1   2007    20070826    559 'VC'
1   2008    20080825     34 'VP'
1   2009    20090116     34 'ZO'
4   2007    20070725     42 'OI'
4   2008    20080712     42 'TF'
4   2008    20080428     42 'XU'
11  2007    20070730    118 'AM'
11  2008    20080912    118 'HK'
11  2009    20090318      2 'VT'
11  2010    20100121      2 'ZZ'

I would like to obtain a new variable that gives for each code (c1) the years in which c1 appears in the sample and the corresponding c4 value. For instance:

Output:

x  2006 2007 2008 2009 2010
1  559  559  34   34   - 
4   -   42   42   -    -
11  -   118  118  2    2 

To get to my cell-array, this is the code I used so far:

a1=T_ANNDAT3;
a2=I{:,7};
a3=I{:,6};
a4=I{:,16};
a5=I{:,1};
TRACK_AN = [num2cell([a2 a1 a4 a3]) a5];
TRACK_AN(cell2mat(TRACK_AN(:,1))==0,:)=[];
[~,indTA,~] = unique(strcat(TRACK_AN(:,1),TRACK_AN(:,2),TRACK_AN(:,4),TRACK_AN(:,5)));
TRACK_AN = TRACK_AN(indTA,:);

Can someone help?


Solution

  • You can calculate this very easily using unique as what you have seen. The key is to use the 'rows' flag as the second parameter into unique so you can figure out the unique row entries for the matrix. We only need the first, second and fourth columns of the matrix for this process so we can just subset those columns out. You also need to use the additional output parameters of unique so we can figure out where exactly the unique rows appear in the original cell array. This is the key property we need for the next part of the algorithm.

    After you find the unique cell array from the first unique call, we apply unique two more times - One for the column of c1 and one more for the column of c2 so we can index the ID and the year. We will use the third output parameter of unique so that we can assign each unique number within each column into a unique ID. We then use accumarray to create the final matrix that you see above, binning the values in the fourth column given the first column serving as rows and the second column serving as columns for this final matrix. In other words:

    %// Create cell array as per your example
    C = {1   2006    20060425    559 'IA'
    1   2007    20070129    559 'LO'
    1   2007    20070826    559 'VC'
    1   2008    20080825     34 'VP'
    1   2009    20090116     34 'ZO'
    4   2007    20070725     42 'OI'
    4   2008    20080712     42 'TF'
    4   2008    20080428     42 'XU'
    11  2007    20070730    118 'AM'
    11  2008    20080912    118 'HK'
    11  2009    20090318      2 'VT'
    11  2010    20100121      2 'ZZ'};
    
    %// Get only those columns that are relevant
    %// These are the first, second and fourth columns
    Cmat = unique(cell2mat(C(:,[1 2 4])), 'rows');
    
    %// Bin each of the first and second columns
    %// Give them a unique ID per unique number    
    [~,~,ind] = unique(Cmat(:,1));
    [~,~,ind2] = unique(Cmat(:,2));
    
    %// Use accumarray to create your matrix    
    %// Edit - Thanks to Amro
    %// Any values that are missing replace with NaN
    finalMat = accumarray([ind ind2], Cmat(:,3), [], [], NaN);
    

    The output is thus:

    finalMat =
    
    559   559    34    34   NaN
    NaN    42    42   NaN   NaN
    NaN   118   118     2     2
    

    I replaced those values that were missing with NaN to signify the missing values.

    Hope this helps!