I'm using a DB table with various different entities. This means that I can't have an arbitrary number of fields in it to save all kinds of different entities. I want instead save just the most important fields (dates, reference IDs - kind of foreign key to various other tables, most important text fields etc.) and an additional text field where I'd like to store more complete object data.
the most obvious solution would be to use XML
strings and store those. The second most obvious choice would be JSON
, that usually shorter and probably also faster to serialize/deserialize... And is probably also faster. But is it really? My objects also wouldn't need to be strictly serializable, because JsonSerializer is usually able to serialize anything. Even anonymous objects, that may as well be used here.
What would be the most optimal solution to solve this problem?
My DB is highly normalised and I'm using Entity Framework, but for the purpose of having external super-fast fulltext search functionality I'm sacrificing a bit DB denormalisation. Just for the info I'm using SphinxSE on top of MySql. Sphinx would return row IDs that I would use to fast query my index optimised conglomerate table to get most important data from it much much faster than querying multiple tables all over my DB.
My table would have columns like:
RowID
(auto increment)EntityID
(of the actual entity - but not directly related because this would have to point to different tables)EntityType
(so I would be able to get the actual entity if needed)DateAdded
(record timestamp when it's been added into this table)Title
Metadata
(serialized data related to particular entity type)This table would be indexed with SPHINX indexer. When I would search for data using this indexer I would provide a series of EntityIDs
and a limit date. Indexer would have to return a very limited paged amount of RowIDs
ordered by DateAdded
(descending). I would then just join these RowIDs
to my table and get relevant results. So this won't actually be full text search but a filtering search. Getting RowIDs
would be very fast this way and getting results back from the table would be much faster than comparing EntityIDs
and DateAdded
comparisons even though they would be properly indexed.
The only means of saving data in a SQL database that will not lead to long-term pain is to actually create a proper, normalized, indexed schema, and extend that schema as necessary when you add new properties to your domain objects.
Please do not attempt to "serialize" objects to a SQL database. If that is really what you want to do, you are better off using an object database such as db4o instead.
Update:
This is how I currently understand the problem space, based on comments and question updates:
My choices for this, in order of preference, would be:
Use the FTS engine's features.
Almost every FTS engine, including Sphinx, allows custom attributes to be stored as part of each "document." Presently you say you are only storing the Row ID so you can join to this table. Your results will come back a lot faster if you don't join at all, and instead keep this information inside the fulltext index itself. There are some pretty hard limits to what you can put in here, but if you can work around the limitation, it's your best option.
Document-Oriented Database.
You say you're not even really using the "Full-Text" part of Sphinx, you're just using it to optimize certain queries. Why not cut out the middle man then? You suggested JSON as a serialization format; MongoDB (to cite just one option) supports BSON natively. You can still create indexes on the common columns, but unlike mysql, it actually understands the BSON format, and is able to store that data a lot more efficiently than a JSON or XML string in a relational database. If you're denormalizing anyway, you have the freedom to choose any repository you want; choose the one that's best-optimized for your particular requirement.
Single-table inheritance.
This is a common design that trades off normalization for mapping simplicity. In your case, the entire objective is denormalization, so it's a good trade. This isn't a good option if there will be hundreds of columns, but for 10 or 20, this will be fine, it keeps your data as "data" and shouldn't impact performance in any significant way.
XML columns.
The advantage to this approach is that the data is not opaque. It's actually meaningful in the context of the database. If you must store this information in the mysql database - presumably you expect some ad-hoc queries to be run - then you might as well store it in a format that mysql can actually understand. On the other hand, if you're 100% positive that you'll never need to "deserialize" this data until it hits your application, then I would probably go with...
Custom binary serialization format.
If you must store the data in your mysql database and you know for sure that you'll never need to index it or even read its contents from within a query, then don't waste precious I/O on bloated text encoding. Even JSON is bloated compared to binary, because JSON has to store all the property names; if you do your own serialization, you can use one or two bytes to determine the type and then deserialize the remaining fields/properties in a known order. Just the data, no metadata.
I wouldn't even use the .NET BinaryFormatter
here, I would create my own heavily-optimized version. After all, this needs to be fast fast fast! and every extra byte that goes into your table makes the query slower. You might even be able to GZip compress some of the data, depending on exactly what's in there.
Unless I still haven't fully understand your requirements, I wouldn't even consider any other options.