I have a bunch of domains I would like to explode into words. I downloaded wordlist from wordlist.sourceforge.net and started writing brute-force type of script to run each domain through dictionary list.
The problem is that I can't get it to produce good enough results. The simple script I did looks like this:
foreach($domains as $dom) {
$orig_dom = $dom;
foreach($words as $w) {
$pos = stristr($dom,$w);
if($pos) {
$wd[$orig_dom][] = $w;
}
}
}
$words is dictionary array and domains is just an array of domain names.
Results looks like this:
[aheadsoftware] => Array
(
[0] => ahead
[1] => head
[2] => heads
[3] => soft
[4] => software
[5] => ware
Technically it works but the thing I don't know how to code is the trick to get the script to understand that if you match 'ahead', you don't have 'head' or 'heads' anymore. It should also understand to pick 'software' instead of 'soft' and 'ware'. Yes I know, world of linguistic computing is pure pain ;)
A naive solution could be every time you have a match and before you add the word in to the results do another stristr
lookup and see if the word you are trying to put in to the results is contained in any of the words already in there. If it is, don't add it in.
This would not work for example if the domain contains 'heads' and your dictionary lists 'head' first. You may rather have 'heads' added in to the results instead of 'head'.
You can get around that limitation by checking to see which one is longer. If the word contained in your results is longer, do not add the new word in. If the new word is longer, remove the one already in the results and add the new one in.