I am trying to create a spaCy Doc object (spacy.tokens.doc.Doc) using the make_doc() function. This is what I have done:
import spacy
nlp = spacy.load('en')
a = nlp.make_doc("Sam, Software Engineer")
print(list(a)) # [Sam, ,, Software, Engineer]
But my desired result is:
print(list(a)) # [Sam, Software Engineer]
Is there a way to create a spacy Doc object based on delimiters (in my case, its a comma)? Or is there a way to combine two spaCy Doc objects into one object? For eg:
a = nlp.make_doc("Sam")
b = nlp.make_doc("Software Engineer")
c = Combine a and b into single Doc object c
print(list(c)) # [Sam, Software Engineer]
You can build a document using the Doc
class after splitting the string with a comma:
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Sam, Software Engineer"
tokens = text.split(',')
words_t = [t.strip() for t in tokens]
whitespaces_t = [x[0].isspace() for x in tokens]
a = spacy.tokens.Doc(nlp.vocab, words=words_t, spaces=whitespaces_t)
print(list(a))
# => [Sam, Software Engineer]
The words_t = [t.strip() for t in tokens]
part grabs words and whitespaces_t = [x[0].isspace() for x in tokens]
creates a list of boolean values denoting the presence of whitespace before the words.