Correct POS tags for numbers substituted with ## in spacy

The gigaword dataset is a huge corpus used to train abstractive summarization models. It contains summaries like these:

spain 's colonial posts #.## billion euro loss
taiwan shares close down #.## percent

I want to process these summaries with spacy and get the correct pos tag for each token. The issue is that all numbers in the dataset were replaced with # signs which spacy does not classify as numbers (NUM) but as other tags.

>>> import spacy
>>> from spacy.tokens import Doc
>>> nlp = spacy.load("en_core_web_sm")
>>> nlp.tokenizer = lambda raw: Doc(nlp.vocab, words=raw.split(' '))
>>> text = "spain 's colonial posts #.## billion euro loss"
>>> doc = nlp(text)
>>> [(token.text, token.pos_) for token in doc]
[('spain', 'PROPN'), ("'s", 'PART'), ('colonial', 'ADJ'), ('posts', 'NOUN'), ('#.##', 'PROPN'), ('billion', 'NUM'), ('euro', 'PROPN'), ('loss', 'NOUN')]

Is there a way to customize the POS tagger so that it classifies all tokens that only consist of #-sign and dots as numbers?

I know you replace the spacy POS tagger with your own or fine-tune it for your domain with additional data but I don't have tagged training data where all numbers are replaced with # and I would like to change the tagger as little as possible. I would prefer having a regular expression or fixed list of tokens that are always recognized as numbers.

Solution

What about replacing # with a digit?

In a first version of this answer I chose the digit 9, because it reminds me of the COBOL numeric field formats I used some 30 years ago... But then I had a look at the dataset, and realized that for proper NLP processing one should get at least a couple of things straight:

ordinal numerals (1st, 2nd, ...)
dates

Ordinal numerals need special handling for any choice of digit, but the digit 1 produces reasonable dates, except for the year (of course, 1111 may or may not be interpreted as a valid year, but let's play it safe). 11/11/2020 is clearly better than 99/99/9999...

Here is the code:

import re

ic = re.IGNORECASE
subs = [
    (re.compile(r'\b1(nd)\b', flags=ic), r'2\1'),  # 1nd -> 2nd
    (re.compile(r'\b1(rd)\b', flags=ic), r'3\1'),  # 1rd -> 3rd
    (re.compile(r'\b1(th)\b', flags=ic), r'4\1'),  # 1th -> 4th
    (re.compile(r'11(st)\b', flags=ic), r'21\1'),  # ...11st -> ...21st
    (re.compile(r'11(nd)\b', flags=ic), r'22\1'),  # ...11nd -> ...22nd
    (re.compile(r'11(rd)\b', flags=ic), r'23\1'),  # ...11rd -> ...23rd
    (re.compile(r'\b1111\b'), '2020')              # 1111 -> 2020
]

text = '''spain 's colonial posts #.## billion euro loss
#nd, #rd, #th, ##st, ##nd, ##RD, ##TH, ###st, ###nd, ###rd, ###th.
ID=#nd#### year=#### OK'''

text = text.replace('#', '1')
for pattern, repl in subs:
    text = re.sub(pattern, repl, text)

print(text)
# spain 's colonial posts 1.11 billion euro loss
# 2nd, 3rd, 4th, 21st, 22nd, 23RD, 11TH, 121st, 122nd, 123rd, 111th.
# ID=1nd1111 year=2020 OK

If the preprocessing of the corpus converts any digit into a # anyway, you lose no information with this transformation. Some “true” # would become a 1, but this would probably be a minor problem compared to numbers not being recognized as such. Furthermore, in a visual inspection of about 500000 lines of the dataset I haven't been able to find any candidate for a “true” #.

N.B.: The \b in the above regular expressions stands for “word boundary”, i.e., the boundary between a \w (word) and a \W (non-word) character, where a word character is any alphanumeric character (further info here). The \1 in the replacement stands for the first group, i.e., the first pair of parentheses (further info here). Using \1 the case of all text is preserved, which would not be possible with replacement strings like 2nd. I later found that your dataset is normalized to all lower case, but I decided to keep it generic.

If you need to get the text with #s back from the parts of speech, it's simply

token.text.replace('0','#').replace('1','#').replace('2','#').replace('3','#').replace('4','#')