Search code examples
pythonword-embeddingfasttext

Fasttext model representations for numbers


I would like to create a fasttext model for numbers. Is this a good approach?

Use Case:

I have a given number set of about 100.000 integer invoice-numbers. Our OCR sometimes creates false invoice-numbers like 1000o00 or 383I338, so my idea was to use fasttext to predict nearest invoice-number based on my 100.000 integers. As correct invoice-numbers are known in advance, I trained a fastext model with all invoice-numbers to create a word-embeding space just with invoices-numbers.

But it is not working and I don´t know if my idea is completly wrong? But I would assume that even if I have no sentences, embedding into vector space should work and therefore also a similarity between 383I338 and 3831338 should be found by the model.

Here some of my code:

import pandas as pd
from random import seed
from random import randint
import fasttext
# seed random number generator
seed(9999)
number_of_vnr = 100000
min_vnr = 1111    
max_vnr = 999999999

# generate vnr integers
versicherungsscheinnummern = [randint(min_vnr, max_vnr) for i in range(number_of_vnr)]

# save numbers as csv
df_vnr = pd.DataFrame(versicherungsscheinnummern, columns=['VNR'])
df_vnr['VNR'].dropna().astype(str).to_csv('vnr_str.csv', index=False)

# train model
model = fasttext.train_unsupervised('vnr_str.csv',"cbow", minn=2, maxn=5)  

Even data in the space is not found

model.get_nearest_neighbors("833803015")
[(0.10374893993139267, '</s>')]

model has no words

model.words
["'</s>'"]

Solution

  • I doubt FastText is the right approach for this.

    Unlike in natural-languages, where word roots/prefixes/suffixes (character n-grams) can be hints to meaning, most invoice number schemes are just incrementing numbers.

    Every '###' or '####' is going to have a similar frequency. (Well, perhaps there'd be a little bit of a bias towards lower digits to the left, for Benford's Law-like reasons.) Unless the exact same invoice numbers repeat often* throughout the corpus, so that the whole token, & its fragments, acquire a word-like meaning from surrounding other tokens, FastText's post-training nearest-neighbors are unlikely to offer any hints about correct numbers. (For it to have a chance to help, you'd want the same invoice-numbers to not just repeat many times, but for a lot of those appeearances to have similar OCR errors - but I strongly suspet your corpus instead has invoice numbers only on individual texts.)

    Is the real goal to correct the invoice-numbers, or just to have them be less-noisy in a model thaat's trained on a lot more meaningful, text-like tokens? (If the latter, it might be better just to discard anything that looks like an invoice number – with or without OCR glitches – or is similarly so rare it's likely an OCR scanno.)

    That said, statistical & edit-distance methods could potentially help if the real need is correcting OCR errors - just not semantic-context-dependent methods like FastText. You might get useful ideas from Peter Norvig's classic writeup on "How to Write a Spelling Corrector".