Search code examples
c#amazon-s3lucene.netamazon-simpledb

Use lucene to store indexing meta data or Amazon SimpleDB?


I am architecting a web solution which takes uploaded files and places them on S3. When uploading the files users can add meta data for indexing and archiving purposes. I had planned to use Lucene for this purpose as I have used it many times before, but also noticed that Amazon SimpleDB offers an object meta data service for S3.

I am attracted to SimpleDB by the lack of maintenance and overhead on the machine serving the web application, and the distributed nature of SimpleDB over Lucene's single location index file.

The requirements are that users will need to have an ajax search as you type web interface which Lucene can provide but SimpleDB could also do What would I be gaining / losing by using SimpleDB indexing over Lucene in this limited scope application?

Thanks for your knowledge.


Solution

  • I've used SimpleDB for something like this. The advantage, aside from zero-maintenance, is that SimpleDB scales, essentially indefinitely. This is really only an advantage if you want to architect for the possibility of very high traffic.

    The main drawbacks of SimpleDB for this I see are:

    • Higher latency. SimpleDB is designed for huge scalability and high availability. The tradeoff is that requests have a moderate latency - higher that you'd have with a 'local' non-distributed service like Lucene or using RDBMS text search features.

    • Less flexible text search. Simple DB basically has an SQL-like query syntax, which supports the usual =, !=, >, < etc. and also LIKE where the wildcard "%" can appear either at the start of the string, end of the string or both (e.g. "%keyword%"). There is no way to search for regexes or more complex patterns (except what you can do by combining the operators with AND/OR). Note: the LIKE condition previously only supported "%" at the end of the string - a limitation you may see written around the web but that no longer exists.

    SimpleDB also uses the 'eventual consistency' model by default (updates may take a little while - 10s of secs sometimes - to be visible consistently). That is a consequence of scalability that can't be avoided. However, I doubt it will be an issue for your use-case.