Basically - I want to calculate the "Proximity" of various terms. By "proximity" I means Specifically the number of spaces/characters/words that sit between them.
Example:
Terms = Word1 / Word2 Chunk = "blah Word1 blah blah blah blah blah Word2 blah" Proximity = Word1-Word2:5 THe script would see the 2 terms, locate them and then see the distance based on the words that lay between them.
A more advanced version would be to examine the semantic structure - and identify whether the terms occur within the same semantic element, or a sibling, or a parent etc. Thus proximity discovery of terms may be within the same paragraph, or in sequential paragraphs, or under the same "parent" (heading) but otherwise separate etc.
Further - introducing things like word stemming/relationships/soundings at a later date may be useful too.
.
I've looked around the net (Google, here, php forums, php script sites). Not seeing anything like it. I can see tools on some sites that do similar (limited) - usually SEO based tools. I want to be able to apply this to "text" in general ... as I may apply it to uploaded word/txt files etc.
I'm not seeing any real examples - so I can only assume it's mroe than a trifle to code it.
The question is - how can I do this? How would I handle variant order of the words (Word1+Word2 / Word2+Word1)? How could I handle identifying proximity within/outside of the same element/structure?
Hoping someone can shed some light/make some suggestions.
If you need to do a lot of this kind of search on a given text, you could begin by indexing the whole text into a database containing the word, its position in the text, and the paragraph number (if needed). Then, you could select all the Word1 and Word2 positions, and it shouldn't be too hard to infer the minimal distance.
Edit: Here is a try for a simple algorithm for a one-shot, without using database.