I have an unbalanced panel of city-year level data. After padding missing values with NaNs, I import the data as columns into matlab. For example:
city year population area
nyc 2000 140 22
nyc 2001 151 22
nyc 2002 NaN 22.5
nyc 2003 NaN 22.5
boston 2000 NaN 10
boston 2001 75 11
boston 2002 76 10.5
boston 2003 77 10
chicago 2000 43 34
chicago 2001 42 31
chicago 2002 43 NaN
chicago 2003 45 34
I can reshape each column to a matrix, in which one column represent one city. For example:
mPopulation = reshape(population,[4,3])
gives
140 NaN 43
151 75 42
NaN 76 43
NaN 77 45
Similarly, mArea = reshape(area,[4,3])
gives
22 10 34
22 11 34
22.5 10.5 NaN
22.5 10 34
Note that (1) mPopulation
and mArea
have NaN
in different positions. (2) Within a matrix, different cities have different number of missing values.
Later analysis cannot be performed on missing values, so I think I need to reshape each column to a cell, as it allows for columns with different number of rows.
Therefore, the questions are (1) how to convert the matrices into cells? (2) how to removing missing values for the same city across different cells, so each cell has the same number of rows for the same city?
The final results we want to get are two cells cPopulation
and cArea
:
140 75 43
151 76 42
77 45
and
22 11 34
22 10.5 34
10 34
You can see that, since population is missing for nyc-2002 and nyc-2003, we also removed the area data for nyc in these two years. After removing missing entries in this way, nyc only has 2 rows of data left, while boston and chicago have 3 rows of data. So the final data can't be stored in matrices...
EDIT
Another way to do this is, removing missing values from the main matrix first:
city year population area
nyc 2000 140 22
nyc 2001 151 22
boston 2001 75 11
boston 2002 76 10.5
boston 2003 77 10
chicago 2000 43 34
chicago 2001 42 31
chicago 2003 45 34
However, it's hard to convert each column into a cell. Since different cities have different number of rows already, we cannot use the reshape
command, i.e. cannot tell matlab it's 4-by-3 for each city...
Before answering I should say that @cris-luengo comment is a much better idea and I would also suggest to handle NaN
s rather than using cell arrays.
However, one approach is:
data = readtable('path\to\data');
cleanedData = data(~any(isnan(data{:, {'population', 'area'}}), 2), :);
[uniqueCities, ~, cityId] = unique(cleanedData.city, 'stable');
cPopulation = arrayfun(@(x) cleanedData.population(cityId == x), 1:length(uniqueCities), 'uni', 0);
cArea = arrayfun(@(x) cleanedData.area(cityId == x), 1:length(uniqueCities), 'uni', 0);
Here we first create a new copy of the data which has rows removed if either population
or area
are NaN
.
Next we use the third output of the unique
function to create a vector identifying the location of the unique cities in the table.
Finally, using the arrayfun
function we extract the required data, specifying that the output should be a cell array by passing the name-value pair uni
and false