Search code examples
algorithmdata-miningclassificationlevenshtein-distancetext-mining

URL path similarity/string similarity algorithm


My problem is that I need to compare URL paths and deduce if they are similar. Below I provide example data to process:

# GROUP 1
/robots.txt

# GROUP 2
/bot.html

# GROUP 3
/phpMyAdmin-2.5.6-rc1/scripts/setup.php
/phpMyAdmin-2.5.6-rc2/scripts/setup.php
/phpMyAdmin-2.5.6/scripts/setup.php
/phpMyAdmin-2.5.7-pl1/scripts/setup.php
/phpMyAdmin-2.5.7/scripts/setup.php
/phpMyAdmin-2.6.0-alpha/scripts/setup.php
/phpMyAdmin-2.6.0-alpha2/scripts/setup.php

# GROUP 4
//phpMyAdmin/

I tried Levenshtein distance to compare, but for me is not enough accurate. I do not need 100% accurate algorithm, but I think 90% and above is a must.

I think that I need some sort of classifier, but the problem is that each portion of new data can containt path that should be classified to the new unknown class.

Could you please direct me to the right thoutht?

Thanks


Solution

  • When checking @jakub.gieryluk suggestion I accidentally have found solution that satisfy me - "Hobohm clustering algorithm, originally devised to reduce redundancy of biological sequence data sets."

    Tests of PERL library implemented by Bruno Vecchi gave me really good results. The only problem is that I need Python implementation, but I belive that I can either find one on the Internet or reimplement code by myself.

    Next thing is that I have not checked active learning ability of this algorithm yet ;)