Search code examples
matlabpaddingmissing-data

Adding 0 for missing data rather than excluding the category in matlab


I have the two following tables of data, one named data1, the other named data2. The left-hand column is a categorical variable and the right hand column is frequency I would like to rewrite these tables but where there are missing categories in the left-hand column I would like it to put in the correct missing category and then put a '0' in the right-hand frequency column.

data1 = [
    
1    170
2    120
3    100
4     40
5     30
6     20
7     10
9     8
10    2
11    1
14    1
];


data2 = [
    
1    240
2    200
3    180
4     60
5     50
6     40
7     30
8     20
9     8
10    2
12    1
19    1
];

To be clearer I will explain with an example. In data1, 8 12 and 13 are missing in the left-hand column. I would like matlab to recreate this table but with 0 values for 8, 12 and 13 so it looks as follows. I would also like it to have additional empty categories after '14' because data2 is longer and has more categories. I have also included what data2 should look like with filled in values.

data1 = [
    
1    170
2    120
3    100
4     40
5     30
6     20
7     10
8     0
9     8
10    2
11    1
12    0
13    0
14    1
15    0
16    0
17    0
18    0
19    0
];


data2 = [
    
1    240
2    200
3    180
4     60
5     50
6     40
7     30
8     20
9     8
10    2
11    0
12    1
13    0
14    0
15    0
16    0
17    0
18    0
19    1
];

I have a handful of datasets which generally all start with 1,2,3,4,5...etc but then they all have slightly different categories on the left-hand column, because where values are missing it just omits the category rather than putting 0. How do i write a code so that it automatically fills in any blanks with a 0. It would be good if the code could identify what the 'highest' number of categories is amongst all the datasets and then fill in blanks based on this.

my aim is to put together a grouped bar chart with data series that are all the same length.

UPDATED OUTPUT WITH 3 DATASETS

this is what your AllJoins code outputs in my matlab:

     A     table1    table2    table3
     __    ______    ______    ______

     1     170       240       2400 
     2     120       200       2000 
     3     100       180          0 
     4      40        60          0 
     5      30        50          0 
     6      20        40          0 
     7      10        30          0 
     8       0        20          0 
     9       8         8          0 
    10       2         2          0 
    11       1         0          0 
    12       0         1          0 
    14       1         0          0 
    19       0         1          0 
    20       0         0       1800 

I would like the code to fill in the missing consecutive numbers in column A so that it looks as follows:

A     table1    table2    table3
__    ______    ______    ______

 1     170       240       2400 
 2     120       200       2000 
 3     100       180          0 
 4      40        60          0 
 5      30        50          0 
 6      20        40          0 
 7      10        30          0 
 8       0        20          0 
 9       8         8          0 
10       2         2          0 
11       1         0          0 
12       0         1          0 
13       0         0          0
14       1         0          0
15       0         0          0
16       0         0          0 
17       0         0          0
18       0         0          0
19       0         1          0 
20       0         0       1800 

Solution

  • You can convert the datasets to a table and then use outerjoin. Then you can replace the NaNs with whatever you want using fillmissing.

    table1 = array2table(data1);
    table1.Properties.VariableNames = {'A', 'B'};
    table2 = array2table(data2);
    table2.Properties.VariableNames = {'A', 'B'};
    
    newTable = outerjoin(table1, table2, 'LeftKeys', {'A'}, 'RightKeys', {'A'}, 'MergeKeys', true)
    

    which produces:

    A     B_table1    B_table2
    __    ________    ________
    
     1      170         240   
     2      120         200   
     3      100         180   
     4       40          60   
     5       30          50   
     6       20          40   
     7       10          30   
     8      NaN          20   
     9        8           8   
    10        2           2   
    11        1         NaN   
    12      NaN           1   
    14        1         NaN   
    19      NaN           1  
    

    And then get your zeros with newTable2 = fillmissing(newTable, 'constant', 0), which prints:

    A     B_table1    B_table2
    __    ________    ________
    
     1      170         240   
     2      120         200   
     3      100         180   
     4       40          60   
     5       30          50   
     6       20          40   
     7       10          30   
     8        0          20   
     9        8           8   
    10        2           2   
    11        1           0   
    12        0           1   
    14        1           0   
    19        0           1   
    

    UPDATE

    To combine multiple tables, you can either nest the outerjoin or write a function to loop over it (see similar Matlab forum question). Here's an example.

    Given data1 and data2 in OP, plus a new data3:

    data3 = [   
    1    2400
    2    2000
    20    1800
    ];
    

    Contents of myscript.m:

    table1 = MakeTable(data1);
    table2 = MakeTable(data2);
    table3 = MakeTable(data3);
    
    AllJoins = MultiOuterJoin(table1, table2, table3);
    
    
    % Functions
    
    function Table = MakeTable(Array)
        Table = array2table(Array);
        Table.Properties.VariableNames = {'A', 'B'}; % set your column names, e.g. {'freq', 'count'}
    end
    
    
    function Joined = MultiOuterJoin(varargin)
        Joined = varargin{1};
        Joined.Properties.VariableNames{end} = inputname(1); % set #2 column name to be based on table name
        for k = 2:nargin
          Joined = outerjoin(Joined, varargin{k}, 'LeftKeys', {'A'}, 'RightKeys', {'A'}, 'MergeKeys', true);
          name = inputname(k);
          Joined.Properties.VariableNames{end} = name; % set merged column name to be based on table name
        end
    end
    

    Which returns AllJoins:

    A     table1    table2    table3
    __    ______    ______    ______
    
     1     170       240       2400 
     2     120       200       2000 
     3     100       180        NaN 
     4      40        60        NaN 
     5      30        50        NaN 
     6      20        40        NaN 
     7      10        30        NaN 
     8       0        20        NaN 
     9       8         8        NaN 
    10       2         2        NaN 
    11       1         0        NaN 
    12       0         1        NaN 
    13       0         0        NaN 
    14       1         0        NaN 
    15       0         0        NaN 
    16       0         0        NaN 
    17       0         0        NaN 
    18       0         0        NaN 
    19       0         1        NaN 
    20     NaN       NaN       1800