For a project I am working on, I want to identify abbreviations the first time they are introduced in a text.
For example:
He was working for the Danish National Bank (DNB).
(...)
The DNB was a great employer.
Should match DNB as an abbreviation for Danish National Bank. Not all abbreviations are capitals though:
In 2012 the Law equal treatment of Circus Workers (after this: LetCW) was introduced.
Which should return extract LetCW. What is the best approach to do this? I am currently thinking about removing "after this" and then taking the same amount of words before the brackets as there are letters in the suspected abbreviation.
EDIT: Another interesting case is the abbreviation of a single word, i.e.:
Abbreviation (Abbr)
or
Abbreviation (Abvn)
This is an NLP problem, but it does not impress me as a regex problem - that does not appear to be the most appropriate tool.
It seems that you want to parse a token stream and identify promising tokens that potentially are abbreviations. They may, for example, be parenthesis delimited or comma delimited. Annoyingly, they may appear immediately before or after a definition phrase, once stopwords ("the", "i.e.", "after this") have been deleted. One heuristic for identifying potential abbreviations would be case-sensitive match showing non-membership in an English language dictionary.
Having identified a potential abbreviation token, you'll want to scan its immediate neighborhood to see if you can explain it in terms of nearby words, ideally using just their initial letters. For a truly challenging dataset, you might try explaining DARPA backronyms.
To take this in a different direction, you might try applying word2vec. Here it would be phrase2vec, and the challenge would be to scalably identify multi-word phrases with very very small cosine distance to potential abbreviation tokens.