Word and phrase counting with XSLT

We would like to build a dictionary of the documentation of the products our company makes, to create a fixed terminology, so we would like to count the frequency of specific words and phrases.

This could be solved in a couple of different ways, but what we would like to solve somehow is to write an XSLT algorithm which can recognize phrases, as specific words occuring together often (so we don't have to specify beforehand all the phrases and all their versions with different conjugations, affixations, etc.).

What do you think, could this task be done with XSLT, or should we look after other solutions?

If anyone has any useful advice how we should start, I would be more than happy to hear about your ideas and have a conversation about this!

Solution

You're looking for collocations, which in algorithmic terms is linked with Pointwise mutual information.

In XSLT, there is no framework for natural language processing (NLP), so you would have to invent one. However, there are NLP frameworks for programming languages, like Python's NLTK. Check out this example for finding collocations using Python.

It might be easiest to use an external app written in a popular data mining language like Python or R. (You could even plug it into your DITA OT processing.) You might also look at vendors with existing solutions. I haven't done any in-depth search for that, but I've seen systems like Watson, Semaphore, or even XDocs, return results from language analysis.