Challenging Regular Expression for Abbreviations

For a project I am working on, I want to identify abbreviations the first time they are introduced in a text.

For example:

He was working for the Danish National Bank (DNB).

(...)

The DNB was a great employer.

Should match DNB as an abbreviation for Danish National Bank. Not all abbreviations are capitals though:

In 2012 the Law equal treatment of Circus Workers (after this: LetCW) was introduced.

Which should return extract LetCW. What is the best approach to do this? I am currently thinking about removing "after this" and then taking the same amount of words before the brackets as there are letters in the suspected abbreviation.

EDIT: Another interesting case is the abbreviation of a single word, i.e.:

Abbreviation (Abbr)

Abbreviation (Abvn)

Solution

This is an NLP problem, but it does not impress me as a regex problem - that does not appear to be the most appropriate tool.

It seems that you want to parse a token stream and identify promising tokens that potentially are abbreviations. They may, for example, be parenthesis delimited or comma delimited. Annoyingly, they may appear immediately before or after a definition phrase, once stopwords ("the", "i.e.", "after this") have been deleted. One heuristic for identifying potential abbreviations would be case-sensitive match showing non-membership in an English language dictionary.

Having identified a potential abbreviation token, you'll want to scan its immediate neighborhood to see if you can explain it in terms of nearby words, ideally using just their initial letters. For a truly challenging dataset, you might try explaining DARPA backronyms.

To take this in a different direction, you might try applying word2vec. Here it would be phrase2vec, and the challenge would be to scalably identify multi-word phrases with very very small cosine distance to potential abbreviation tokens.