Search code examples
urlcluster-analysissimilarity

Clustering a huge number of URLs


I have to find similar URLs like

'http://teethwhitening360.com/teeth-whitening-treatments/18/'
'http://teethwhitening360.com/laser-teeth-whitening/22/'
'http://teethwhitening360.com/teeth-whitening-products/21/' 'http://unwanted-hair-removal.blogspot.com/2008/03/breakthroughs-in-unwanted-hair-remo'
'http://unwanted-hair-removal.blogspot.com/2008/03/unwanted-hair-removal-products.html'
'http://unwanted-hair-removal.blogspot.com/2008/03/unwanted-hair-removal-by-shaving.ht'

and gather them in groups or clusters. My problems:

  • The number of URLs is large (1,580,000)
  • I don't know which clustering or method of finding similarities is better

I would appreciate any suggestion on this.


Solution

  • There are a few problems at play here. First you'll probably want to wash the URLs with a dictionary, for example to convert

    http://teethwhitening360.com/teeth-whitening-treatments/18/

    to

    teeth whitening 360 com teeth whitening treatments 18

    then you may want to stem the words somehow, eg using the Porter stemmer:

    teeth whiten 360 com teeth whiten treatment 18

    Then you can use a simple vector space model to map the URLs in an n-dimensional space, then just run k-means clustering on them? It's a basic approach but it should work.

    The number of URLs involved shouldn't be a problem, it depends what language/environment you're using. I would think Matlab would be able to handle it.