I'm reading in a csv file that is about 80MB - data_O3
. It's about 250,000 x 5 in size. I created E
, which is a little bit larger because it has all the days (data_O3
is missing some days). I want to compare the two so that if the date (saved in variable d3
) and siteID (d4
) are the same, the data point (column 5) is placed in E
.
for j = 1:size(data_O3,1)
E(strcmp(d3,data_O3{j,3})&d4 == data_O3{j,4},5) = data_O3(j,5);
end
This script works fine, but for some reason, running it takes longer than expected. I've run the same code for other data that were only slightly smaller with no problem. Is this an issue with the strcmp
code or something else?
The script and files used can be found here: https://www.dropbox.com/sh/7bzq3m1ixfeuhu6/i4oOvxHPkn
There are certainly see a number of ways to speed this up significantly.
First of all, read in all numeric data in as numbers. Matlab is not optimized to work with strings, and even cells should generally be avoided as much as possible. If you want to keep everything as strings, use another language (python or perl)
Once you have the state, county and site read in as numbers, then create a number instead of a string for the siteID. One approach would be to use the formula:
siteID = siteNum + 1e4*countyCode + 1e7*stateCode
That would generate unique siteIDs for all sites.
Use datenum to convert the date field into a number.
You are now in a position where the data_O3
defined on line 79 can be a purely numeric array (no cells!), as can your E
matrix. That alone will make the process many times faster.
You also might want to define the E
as something other than NaN. Maybe give it values of -1.
There may be more optimizations you can do in the comparison, but do the above first and I expect you will see a huge improvement.