I need to store the number of plays for every second of a podcast / audio file. This will result in a simple timeline graph (like the "hits" graph in Google Analytics) with seconds on the x-axis and plays on the y-axis.
However, these podcasts could potentially go on for up to 3 hours, and 100,000 plays for each second is not unrealistic. That's 10,800 seconds with up to 100,000 plays each. Obviously, storing each played second in its own row is unrealistic (it would result in 1+ billion rows) as I want to be able to fetch this raw data fast.
So my question is: how do I best go about storing these massive amounts of timeline data?
One idea I had was to use a text/blob column and then comma-separate the plays, each comma representing a new second (in sequence) and then the number for the amount of times that second has been played. So if there's 100,000 plays in second 1 and 90,000 plays in second 2 and 95,000 plays in second 3, then I would store it like this: "100000,90000,95000,[...]" in the text/blob column.
Is this a feasible way to store such data? Is there a better way?
Thanks!
Edit: the data is being tracked to another source and I only need to update the raw graph data every 15min or so. Hence, fast reads is the main concern.
Note: due to nature of this project, each played second will have to be tracked individually (in other words, I can't just track 'start' and 'end' of each play).
Problem with the blob storage is you need to update the entire blob for all of the changes. This is not necessarily a bad thing. Using your format: (100000, 90000,...), 7 * 3600 * 3 = ~75K bytes. But that means you're updating that 75K blob for every play for every second.
And, of course, the blob is opaque to SQL, so "what second of what song has the most plays" will be an impossible query at the SQL level (that's basically a table scan of all the data to learn that).
And there's a lot of parsing overhead marshalling that data in and out.
On the other hand. Podcast ID (4 bytes), second offset (2 bytes unsigned allows pod casts up to 18hrs long), play count (4 byte) = 10 bytes per second. So, minus any blocking overhead, a 3hr song is 3600 * 3 * 10 = 108K bytes per song.
If you stored it as a blob, vs text (block of longs), 4 * 3600 * 3 = 43K.
So, the second/row structure is "only" twice the size (in a perfect world, consult your DB server for details) of a binary blob. Considering the extra benefits this grants you in terms of being able to query things, that's probably worth doing.
Only down side of second/per row is if you need to to a lot of updates (several seconds at once for one song), that's a lot of UPDATE traffic to the DB, whereas with the blob method, that's likely a single update.
Your traffic patterns will influence that more that anything.