pandas nlp spacy named-entity-recognition

NLP: Create spaCy Doc objects based on delimiters or combine multiple Doc objects to form a single object

I am trying to create a spaCy Doc object (spacy.tokens.doc.Doc) using the make_doc() function. This is what I have done:

import spacy
nlp = spacy.load('en')

a = nlp.make_doc("Sam, Software Engineer")
print(list(a)) # [Sam, ,, Software, Engineer]

But my desired result is:

print(list(a)) # [Sam, Software Engineer]

Is there a way to create a spacy Doc object based on delimiters (in my case, its a comma)? Or is there a way to combine two spaCy Doc objects into one object? For eg:

a = nlp.make_doc("Sam")
b = nlp.make_doc("Software Engineer")
c = Combine a and b into single Doc object c
print(list(c)) # [Sam, Software Engineer]

Solution

You can build a document using the Doc class after splitting the string with a comma:

import spacy

nlp = spacy.load("en_core_web_sm")
text = "Sam, Software Engineer"

tokens = text.split(',')
words_t = [t.strip() for t in tokens]
whitespaces_t = [x[0].isspace() for x in tokens]
a = spacy.tokens.Doc(nlp.vocab, words=words_t, spaces=whitespaces_t)
print(list(a))
# => [Sam, Software Engineer]

The words_t = [t.strip() for t in tokens] part grabs words and whitespaces_t = [x[0].isspace() for x in tokens] creates a list of boolean values denoting the presence of whitespace before the words.