Search code examples
machine-learningnlpdata-scienceword2vecword-embedding

Using a Word2Vec Model to Extract Data


I've used gensim Word2Vec to learn the embedding of monetary amounts and other numeric data in bank transaction memos. The goal is to use this to be able to extract these amounts and currencies from future input strings.

Design Our input strings are something like

"AMAZON.COM TXNw98e7r3347 USD 49.00 @ 1.283"

During preprocessing, I tokenize and also replace all tokens that have the possibility of being a monetary amount (string consisting only of digits, commas, and <= 1 decimal point/period) with a special VALUE_TOKEN. And I also manually replace exchange rates with RATE_TOKEN. The result would be

["AMAZON", ".COM", "TXNw", "98", "e", "7", "r", "3347", "USD", "VALUE_TOKEN", "@", "RATE_TOKEN"]

With all my preprocessed lists of strings in list data, I generate model

model = Word2Vec(data, window=3, min_count=3)

The embeddings of model that I'm most interested in are that of VALUE_TOKEN, RATE_TOKEN, as well as any currencies (USD, EUR, CAD, etc.). Now that I generated the model, I'm not sure what to do with it.

Problem Say I have a new string that the model has never seen before,

new_string = "EUR 299.99 RATE 1.3289 WITH FEE 5.00"

I would like to use model to identify which tokens of new_string is most contextually similar to VALUE_TOKEN (which should return ["299.99", "5.00"]), which is closest to RATE_TOKEN ("1.3289"). It should be able to classify these based on the learned embedding. I can preprocess new_string the way I do with the training data, but because I don't know the exchange rate before hand, all three tokens of ["299.99", "5.00", "1.3289"] will be tagged the same (either with VALUE_TOKEN or a new UNIDENTIFIED_TOKEN).

I've looked into methods like most_similar and similarity but don't think they work for tokens that are not necessarily in the vocabulary. What methods should I use to do this? Is this the right approach?


Solution

  • Word2vec's fuzzy, dense embedded token representations don't strike me as the right tool for what you're doing, though they might perhaps be an indirect contributor to a hybrid approach.

    In particular:

    • The word2vec algorithm originated from, & has the most consistent public results, when applied to natural-language texts, with their particular patterns of relative token frequences, and varied co-occurrences. Certainly, many ahave applied it, with success, to other kinds of text/record data, but such uses may require a lot more preprocessing/parameter-tuning, and to the extent the underlying data has some fixed, highly-repetitive scheme, might be more amenable to other approaches.
    • If you replace all known values with 'VALUE_TOKEN', & all known rates with 'RATE_TOKEN', then the model is only going to learn token-vectors for 'VALUE_TOKEN' & 'RATE_TOKEN'. Such a model won't be able to supply any vector for non-replaced tokens it's never seen like '$1.2345' or '299.99'. Even collapsing all those to 'UNIDENTIFIED_TOKEN' just limits the model to whatever it learned earlier was the vector for 'UNIDENTIFIED_TOKEN' (if any, in the training data).
    • I've not noticed existing word2vec implementations offering an interface for inferring the word-vector for new unknown-vectors, from just one or several new examples of its appearance in-context. They could, in the same style of new-document-vector inference used by 'Paragraph Vectors'/Doc2Vec, but just don't.) The closest I've seen is Gensim's predict_output_word(), which does a CBOW-like forward-propagation on negative-sampling models, to every 'output node' (one per known word), to give a ranked list of the known-words most-likely to appear given some context words.

    That predict_output_word() might, if fed surrounding known-tokens, contribute to your needs by whether it says your 'VALUE_TOKEN' or 'RATE_TOKEN' is a more-likely model-prediction. You could adapt its code to only evaluate those two candidates, if you're always sure the right answer is one or the other, for a speed-up. A simple comparison of the average-of-context-word-vectors, and the candidate-answer vectors, might be as effective as the full forward-propagation.

    Alternatively, you might want use the word2vec model solely as a source of features (via context-words) for some other classifier, which is trained to answer VALUE or TOKEN. This other classifier's input might include things like:

    • some average of the vectors of all nearby tokens
    • the full vectors of closest neighbors
    • a one-hot encoding ('bag-of-words') of all nearby (or 'preceding') or 'following) known-tokens, assuming the vocabulary of non-numerical tokens is fairly short & highly indicative
    • ?

    If the data streams might include arbitrary new or corrupted tokens whose meaning might be inferrable from substrings, you could consider a FastText model as well.