Search code examples
databasehashnosqlamazon-web-servicesdatastore

Hash English words used for lookups in a NoSQL Datastore?


I am building a search engine. I am using NoSQL variety key-value datastores, specifically Amazon SimpleDB, and not a regular RDBMS. I have a table of URLs that point to web pages. I think I need to build another table which can be used to look up which pages contain a given English word.

The structure of this table is: Search (String word, String URL) and my queries would look like select from Search where word = "foo"

Should I hash the words before storing them and for lookup? I. e. should I use a table: Search (String word_hash, String URL) and use queries like select from Search where word = "acbd18db4cc2f85cedef654fccc4a4d8"


Solution

  • The jury is out there for the general case. While it seems that the database would hash internally, there is definitely an important counter-example: BigTable that has it listed as a specific benefit that URL keys such as "com.example.foo/*.html" would cluster together to make it easier to build the Google search index. (see the bigtable paper for details).