Search code examples
nlpspacy

How to reconstruct original text from spaCy tokens, even in cases with complicated whitespacing and punctuation


' '.join(token_list) does not reconstruct the original text in cases with multiple whitespaces and punctuation in a row.

For example:

from spacy.tokenizer import Tokenizer
from spacy.lang.en import English

nlp = English()
# Create a blank Tokenizer with just the English vocab
tokenizerSpaCy = Tokenizer(nlp.vocab)

context_text = 'this    is a     test \n \n \t\t test for    \n testing  -  ./l \t'

contextSpaCyToksSpaCyObj = tokenizerSpaCy(context_text)
spaCy_toks = [i.text for i in contextSpaCyToksSpaCyObj]

reconstruct = ' '.join(spaCy_toks)
reconstruct == context_text

>False 

Is there an established way of reconstructing original text from spaCy tokens?

Established answer should work with this edge case text (you can directly get the source from clicking the 'improve this question' button)

" UNCLASSIFIED U.S. Department of State Case No. F-2014-20439 Doc No. C05795279 Date: 01/07/2016\n\n\n RELEASE IN PART\n B5, B6\n\n\n\n\nFrom: H <[email protected]>\nSent: Monday, July 23, 2012 7:26 AM\nTo: 'millscd @state.gov'\nCc: '[email protected].; '[email protected]'\nSubject Re: S speech this morning\n\n\n\n Waiting to hear if Monica can come by and pick up at 8 to take to Josh. If I don't hear from her, can you send B5\nsomeone else?\n\n Original Message ----\nFrom: Mills, Cheryl D [[email protected]]\nSent: Monday, July 23, 2012 07:23 AM\nTo: H\nCc: Daniel, Joshua J <[email protected]>\nSubject: FW: S speech this morning\n\nSee below\n\n B5\n\ncdm\n\n Original Message\nFrom: Shah, Rajiv (AID/A) B6\nSent: Monday, July 23, 2012 7:19 AM\nTo: Mills, Cheryl D\nCc: Daniel, Joshua.'\nSubject: S speech this morning\n\nHi cheryl,\n\nI look fwd to attending the speech this morning.\n\nI had one last minute request - I understand that in the final version there is no reference to the child survival call to\naction, but their is a reference to family planning efforts. Could you and josh try to make sure there is some specific\nreference to the call to action?\n\nAlso, in terms of acknowledgements it would be good to note torn friedan's leadership as everyone is sensitive to our ghi\ntransition and we want to continue to send the usaid-pepfar-cdc working together public message. I don't know if he is\nthere, but wanted to flag.\n\nLook forward to it.\n\nRaj\n\n\n\n\n UNCLASSIFIED U.S. Department of State Case No. F-2014-20439 Doc No. C05795279 Date: 01/07/2016\n\x0c"


Solution

  • You can very easily accomplish this by changing two lines in your code:

    spaCy_toks = [i.text + i.whitespace_ for i in contextSpaCyToksSpaCyObj]
    reconstruct = ''.join(spaCy_toks)
    

    Basically, each token in spaCy knows whether it is followed by whitespace or not. So you call token.whitespace_ instead of joining them on space by default.