Sphinx: Real-Time Search w/Expiration?

I am designing a search that will be fed around 50 to 200 GB of text data per day (similar to logs) and it only needs to retain that data for week or two. This data will be piped at a constant rate (5,000/per second for example), non-stop, 24 hours a day. After a week or two, the document should drop out of the index never to be heard from again.

The index should be searchable with free-form text across only 1 field (pretty small in size, around 512 characters max). At most, the schema could have 2 attributes that could be categorized.

The system needs to be indexed in near real-time as data is fed to it. A delay of 15 to 30 seconds is acceptable.

We prefer to stream data into the indexer/service with a constant stream of pipe data.

Lastly, a single stand-alone solution is prefer over any type of distribution setup (this will be part of a package to deploy and setup on local machines for testers).

I'm looking closely at Sphinx search engine with RT updates via the API as it checks off most of these. But, I am not seeing an easy way to expire documents after a certain length of time.

I am aware that I could track the IDs and a timestamp and issue a batch DELETE through the Sphinx API. But, that creates an issue of tracking large amounts of IDs in a separate datastore that will need the same kind of 5,000/per second inserts and deleting them when done.

I also have a concern around Sphinx Fragmentation of mass-inserting, and mass-deleting in the middle of inserting.

We would really prefer the search engine/indexer to handle the expiration itself.

I think I can perform a WHERE timestamp < UNIXTIMESTAMP-OF-TWO-WEEEKS-AGO as the where clause in the Sphinx API in order to gather the Document IDs to delete. The problem with that is if the system does not stay ontop of the deletes, the total number of documents/search results will be in the 10s of millions, maybe even billions in count after a two week timeframe if it has to gather a few days worth of document ids to delete. That's not a feasible query.

Solution

You can actually run

 DELETE FROM rt WHERE timestamp < UNIXTIMESTAMP-OF-TWO-WEEEKS-AGO

As a query to delete the old documents, which is much simpler :)

You will also need to call OPTIMIZE INDEX from time to time.

Both these will have to be called on some sort of 'cron' schedule, as they wont be run automatically.

You might be better not using Sphinxes DELETE function at all. When writing RT indexes, as soon as the RAM chunk is full its writen out as a disk chunk. So you end up with a number of disk chunks on the disk. The oldest documents will be in the oldest chunk, sequentially.

So to clear out the oldest documents, you could just dispose of the oldest chunks. (on a rolling basis)

The problem is sphinx does not include a function to delete individual chunks.

Will need to shutdown searchd, delete the chunk(s), manipulate the header files and then restart Sphinx. Not an easy process.

But in the more general sense, not sure if sphinx will be able to keep up with a continuous stream of 5,000/documents per second (even ignoreing delete for a moment) - Sphinx is generally designed for write-infrequently, read-frequently. It builds a (for the most part) monolithic inverted index. This is great for querying, but is very hard to keep updated. Its not great for incremental updates.