Search code examples
pandasnlpspacynamed-entity-recognition

NLP: Create spaCy Doc objects based on delimiters or combine multiple Doc objects to form a single object


I am trying to create a spaCy Doc object (spacy.tokens.doc.Doc) using the make_doc() function. This is what I have done:

import spacy
nlp = spacy.load('en')

a = nlp.make_doc("Sam, Software Engineer")
print(list(a)) # [Sam, ,, Software, Engineer]

But my desired result is:

print(list(a)) # [Sam, Software Engineer]

Is there a way to create a spacy Doc object based on delimiters (in my case, its a comma)? Or is there a way to combine two spaCy Doc objects into one object? For eg:

a = nlp.make_doc("Sam")
b = nlp.make_doc("Software Engineer")
c = Combine a and b into single Doc object c
print(list(c)) # [Sam, Software Engineer]

Solution

  • You can build a document using the Doc class after splitting the string with a comma:

    import spacy
    
    nlp = spacy.load("en_core_web_sm")
    text = "Sam, Software Engineer"
    
    tokens = text.split(',')
    words_t = [t.strip() for t in tokens]
    whitespaces_t = [x[0].isspace() for x in tokens]
    a = spacy.tokens.Doc(nlp.vocab, words=words_t, spaces=whitespaces_t)
    print(list(a))
    # => [Sam, Software Engineer]
    

    The words_t = [t.strip() for t in tokens] part grabs words and whitespaces_t = [x[0].isspace() for x in tokens] creates a list of boolean values denoting the presence of whitespace before the words.