I am trying to develop a music-focused search engine for my final year project.I have been doing some research on Latent Semantic Analysis and how it works on the Internet. I am having trouble understanding where LSI sits exactly in the whole system of search engines. Should it be used after a web crawler has finished looking up web pages?
I don't know much about music retrieval, but in text retrieval, LSA is only relevant if the search engine is making use of the vector space model of information retrieval. Most common search engines, such as Lucene, break each document up into words (tokens), remove stop words and put the rest of them into the index, each usually associated with a term weight indicating the importance of the term within the document.
Now the list of (token,weight) pairs can be viewed as a vector representing the document. If you combine all of these vectors into a huge matrix and apply the LSA algorithm to that (after crawling and tokenising, but before indexing), you can use the result of the LSA algorithm to transform the vectors of all documents before indexing them.
Note that in the original vectors, the tokens represented the dimensions of the vector space. LSA will give you a new set of dimensions, and you'll have to index those (e.g. in the form of auto-generated integers) instead of the tokens.
Furthermore, you will have to transform the query into a vector of (token,weight) pairs, too, and then apply the LSA-based transformation to that vector as well.
I am unsure if anybody actually does all of this in any real-life text retrieval engine. One problem is that performing the LSA algorithm on the matrix of all document vectors consumes a lot of time and memory. Another problem is handling updates, i.e. when a new document is added, or an existing one changes. Ideally, you'd recompute the matrix, re-run LSA, and then modify all existing document vectors and re-generate the entire index. Not exactly scalable.