I am collecting a large amount of data which is most likely going to be a format as follows:
User 1: (a,o,x,y,z,t,h,u)
Where all the variables dynamically change with respect to time, except u - this is used to store the user name. What I am trying to understand since my background is not very intense in "big data", is when I end up with my array, it will be very large, something like 108000 x 3500, since I will be preforming analysis on each timestep, and graphing it, what would be an appropriate database to manage this in is what I am trying to determine. Since this is for scientific research I was looking at CDF and HDF5, and based on what I read here NASA I think I will want to use CDF. But is this the correct way to manage such data for speed and efficiency?
The final data set will have all the users as columns, and the rows will be timestamped, so my analysis program would read row by row to interpret the data. And make entries into the dataset. Maybe I should be looking at things like CouchDB and RDBMS, I just don't know a good place to start. Advice would be appreciated.
This is an extended comment rather than a comprehensive answer ...
With respect, a dataset of size 108000*3500
doesn't really qualify as big data these days, not unless you've omitted a unit such as GB
. If it's just 108000*3500
bytes, that's only 3GB plus change. Any of the technologies you mention will cope with that with ease. I think you ought to make your choice on the basis of which approach will speed your development rather than speeding your execution.
But if you want further suggestions to consider, I suggest:
all of which have some traction in the academic big data community and are beginning to be used outside that community too.