Keeping Numbers in Doc2Vec Tokenization

I’m in the process of trying to get document similarity values for a corpus of approximately 5,000 legal briefs with Doc2Vec (I recognize that the corpus may be a little bit small, but this is a proof-of-concept project for a larger corpus of approximately 15,000 briefs I’ll have to compile later).

Basically, every other component in the creation of the model is going relatively well so far – each brief I have is in a text file within a larger folder, so I compiled them in my script using glob.glob – but I’m running into a tokenization problem. The difficulty is, as these documents are legal briefs, they contain numbers that I’d like to keep, and many of the guides I’ve been using to help me write the code use Gensim’s simple preprocessing, which I believe eliminates digits from the corpus, in tandem with the TaggedDocument feature. However, I want to do as little preprocessing on the texts as possible.

Below is the code I’ve used, and I’ve tried swapping simple_preprocess for genism.utils.tokenize, but when I do that, I get generator objects that don’t appear workable in my final Doc2Vec model, and I can’t actually see how the corpus looks. When I’ve tried to use other tokenizers, like nltk, I don’t know how to fit that into the TaggedDocument component.

brief_corpus = []
for brief_filename in brief_filenames:
    with codecs.open(brief_filename, "r", "utf-8") as brief_file:
        brief_corpus.append(
            gensim.models.doc2vec.TaggedDocument(
                gensim.utils.simple_preprocess( 
                    brief_file.read()),
                    ["{}".format(brief_filename)])) #tagging each brief with its filename

I’d appreciate any advice that anyone can give that would help me combine a tokenizer that just separated on whitespace and didn’t eliminate any numbers with the TaggedDocument feature. Thank you!

Update: I was able to create a rudimentary code for some basic tokenization (I do plan on refining it further) without having to resort to Gensim's simple_preprocessing function. However, I'm having difficulty (again!) when using the TaggedDocument feature - but this time, the tags (which I want to be the file names of each brief) don't match the tokenized document. Basically, each document has a tag, but it's not the right one.

Can anyone possibly advise where I might have gone wrong with the new code below? Thanks!

briefs = []
BriefList = [p for p in os.listdir(FILEPATH) if p.endswith('.txt')]
for brief in BriefList:
     str = open(FILEPATH + brief,'r').read()
     tokens = re.findall(r"[\w']+|[.,!?;]", str)
     tagged_data = [TaggedDocument(tokens, [brief]) for brief in BriefList]
     briefs.append(tagged_data)

Solution

You're likely going to want to write your own preprocessing/tokenization functions. But don't worry, it's not hard to outdo Gensim's simple_preprocess, even with very crude code.

The only thing Doc2Vec needs as the words of a TaggedDocument is a list of string tokens (typically words).

So first, you might be surprised how well it works to just do a default Python string .split() on your raw strings - which just breaks text on whitespace.

Sure, a bunch of the resulting tokens will then be mixes of words & adjoining punctuation, which may be nearly nonsense.

For example, the word 'lawsuit' at the end of the sentence might appear as 'lawsuit.', which then won't be recognized as the same token as 'lawsuit', and might not appear enough min_count times to even be considered, or otherwise barely rise above serving as noise.

But especially for both longer documents, and larger datasets, no one token, or even 1% of all tokens, has that much influence. This isn't exact-keyword-search, where failing to return a document with 'lawsuit.' for a query on 'lawsuit' would be a fatal failure. A bunch of words 'lost' to such cruft may have hadly any effect on the overall document, or model, performance.

As your datasets seem manageable enough to run lots of experiments, I'd suggest trying this dumbest-possible tokenization – only .split() – just as a baseline to become confident that the algorithm still mostly works as well as some more intrusive operation (like simple_preprocess()).

Then, as you notice, or suspect, or ideally measure with some repeatable evaluation, that some things you'd want to be meaningful tokens aren't treated right, gradually add extra steps of stripping/splitting/canonicalizing characters or tokens. But as much as possible: checking that the extra complexity of code, and runtime, is actually delivering benefits.

For example, further refinements could be some mix of:

For each token created by the simple split(), strip off any non-alphanumeric leading/trailing chars. (Advantages: eliminates that punctuation-fouling-words cruft. Disadvantages: might lose useful symbols, like the leading $ of monetary amounts.)
Before splitting, replace certain single-character punctuation-marks (like say ['.', '"', ',', '(', ')', '!', '?', ';', ':']) with the same character with spaces on both sides - so that they're never connected with nearby words, and instead survive a simple .split() as standalone tokens. (Advantages: also prevents words-plus-punctuation cruft. Disadvantages: breaks up numbers like 2,345.77 or some useful abbreviations.)
At some appropriate stage in tokenization, canonicalize many varied tokens into a smaller set of tokens that may be more meaningful than each of them as rare standalone tokens. For example, $0.01 through $0.99 might all be turned into $0_XX - which then has a better chance of influencting the model, & being associated with 'tiny amount' concepts, than the original standalone tokens. Or replacing all digits with #, so that numbers of similar magnitudes share influence, without diluting the model with a token for every single number.

The exact mix of heuristics, and order of operations, will depend on your goals. But with a corpus only in the thousands of docs (rather than hundreds-of-thousands or millions), even if you do these replacements in a fairly inefficient way (lots of individual string- or regex- replacements in serial), it'll likely be a manageable preprocessing cost.

But you can start simple & only add complexity that your domain-specific knowledge, and evaluations, justifies.