Search code examples
phpsearchtagsorganizationontology

An Odd Tag Organization Script


So!

I am working in PHP and have a huge list of taxonomy/tags, say around 100,000.

A similar list of tags can be can be found in the wealth of tags listed under products at Zazzle.com.

I am attempting to programmatically organize this list into a tiered menu of sorts based on the relationship between words, similar strings, and specificity.

I have toyed around with the levenshtein function, similar_text, searching for sub_str(ings), using the Princeton WordNet database, etc. and just can't crack this nut. Essentially, I am trying to build an Ontology out of this database that goes from very general to very specific in tiers. It doesn't have to be perfect, but I have run out of simple keyphrases to search for and ideas of how to go about doing this in a programmatic way and yet still having some semblance of order.

For instance: If I use sub_str, I might end up with Dog->Dogma,Dogra, etc.

If I use levenshtein or similar text, I might end up Bog, Log, Cog, and Dog all very closely related.

This database, or taxonomy - if you will, is also conistently changing and thus atleast part of the analysis has to be done on the fly. The good news is only one level of the result needs to be available. For instance, the near results of a query such as Dog might be small dog, large dog, red dog, blue dog, canine, etc.

I know this is a terrible question, but does anyone have a ray-of-light of at least what steps i should take, any useful functions I could use, queries to research, methodologies, etc?

Thank you for your time.


So far, I have two suggestions for programmetically organizing tags into an ontology.

  1. Find co-occurences of tags to organize them into groups. I believe the idea being that if tags occur together they are probably related.

  2. Use algorithmic stemming to reduce multiple forms/derivations/roots of words to a stem. This should reduce the quantity of tags the script needs to sift through.... in addition to possibly identifying similar tags based on the root stem.


Solution

  • If you have whole sentences or at least more than just single words available, you might want to have a look into Latent semantic analysis

    Don't be scared by the math, once you got the basic idea behind it, it's fairly simple:

    • create a (high-dimensional) term-document matrix of your data
    • essential step: transform your huge sparse matrix into a lower dimension (Singular value decomposition)
    • every [collection of tags/terms] can then be specified by an vector in your lower dimension model
    • the (cosine) similarity between those two vectors is a good measurement for the similiarity of your tags, even they might not be the same stem (you may find dog and barking related)
    • a good input for the term-document matrix is vital

    An excellent read on this [and other IR topics] (Free eBook): Introduction to Information Retrieval
    Have a look at the book, it's very well written and helped me a lot with my IR thesis.