So!
I am working in PHP and have a huge list of taxonomy/tags, say around 100,000.
A similar list of tags can be can be found in the wealth of tags listed under products at Zazzle.com.
I am attempting to programmatically organize this list into a tiered menu of sorts based on the relationship between words, similar strings, and specificity.
I have toyed around with the levenshtein function, similar_text, searching for sub_str(ings), using the Princeton WordNet database, etc. and just can't crack this nut. Essentially, I am trying to build an Ontology out of this database that goes from very general to very specific in tiers. It doesn't have to be perfect, but I have run out of simple keyphrases to search for and ideas of how to go about doing this in a programmatic way and yet still having some semblance of order.
For instance: If I use sub_str, I might end up with Dog->Dogma,Dogra, etc.
If I use levenshtein or similar text, I might end up Bog, Log, Cog, and Dog all very closely related.
This database, or taxonomy - if you will, is also conistently changing and thus atleast part of the analysis has to be done on the fly. The good news is only one level of the result needs to be available. For instance, the near results of a query such as Dog might be small dog, large dog, red dog, blue dog, canine, etc.
I know this is a terrible question, but does anyone have a ray-of-light of at least what steps i should take, any useful functions I could use, queries to research, methodologies, etc?
Thank you for your time.
So far, I have two suggestions for programmetically organizing tags into an ontology.
Find co-occurences of tags to organize them into groups. I believe the idea being that if tags occur together they are probably related.
Use algorithmic stemming to reduce multiple forms/derivations/roots of words to a stem. This should reduce the quantity of tags the script needs to sift through.... in addition to possibly identifying similar tags based on the root stem.
If you have whole sentences or at least more than just single words available, you might want to have a look into Latent semantic analysis
Don't be scared by the math, once you got the basic idea behind it, it's fairly simple:
An excellent read on this [and other IR topics] (Free eBook): Introduction to Information Retrieval
Have a look at the book, it's very well written and helped me a lot with my IR thesis.