Clustering a huge number of URLs

I have to find similar URLs like

'http://teethwhitening360.com/teeth-whitening-treatments/18/'
'http://teethwhitening360.com/laser-teeth-whitening/22/'
'http://teethwhitening360.com/teeth-whitening-products/21/' 'http://unwanted-hair-removal.blogspot.com/2008/03/breakthroughs-in-unwanted-hair-remo'
'http://unwanted-hair-removal.blogspot.com/2008/03/unwanted-hair-removal-products.html'
'http://unwanted-hair-removal.blogspot.com/2008/03/unwanted-hair-removal-by-shaving.ht'

and gather them in groups or clusters. My problems:

The number of URLs is large (1,580,000)
I don't know which clustering or method of finding similarities is better

I would appreciate any suggestion on this.

Solution

There are a few problems at play here. First you'll probably want to wash the URLs with a dictionary, for example to convert

http://teethwhitening360.com/teeth-whitening-treatments/18/

teeth whitening 360 com teeth whitening treatments 18

then you may want to stem the words somehow, eg using the Porter stemmer:

teeth whiten 360 com teeth whiten treatment 18

Then you can use a simple vector space model to map the URLs in an n-dimensional space, then just run k-means clustering on them? It's a basic approach but it should work.

The number of URLs involved shouldn't be a problem, it depends what language/environment you're using. I would think Matlab would be able to handle it.