Search code examples
pythontextdata-miningkeywordanalysis

Remove duplicates from a tuple


I tried to extract keywords from a text. By using "en_core_sci_lg" model, I got a tuple type of phrases/words with some duplicates which I tried to remove from it. I tried deduplicate function for list and tuple, I only got fail. Can anyone help? I really appreciate it.

text = """spaCy is an open-source software library for advanced natural language processing,
written in the programming languages Python and Cython. The MIT library is published under the MIT license and its main developers are Matthew Honnibal and Ines Honnibal, the founders of the software company Explosion."""

one sets of codes I have tried:

import spacy
nlp = spacy.load("en_core_sci_lg")

doc = nlp(text)
my_tuple = list(set(doc.ents))
print('original tuple', doc.ents, len(doc.ents))
print('after set function', my_tuple, len(my_tuple))

the output:

original tuple: (spaCy, open-source software library, programming languages, Python, Cython, MIT, library, published, MIT, license, developers, Matthew Honnibal, Ines, Honnibal, founders, software company Explosion) 16

after set function: [Honnibal, MIT, Ines, software company Explosion, founders, programming languages, library, Matthew Honnibal, license, Cython, Python, developers, MIT, published, open-source software library, spaCy] 16

the desired output is(there should be one MIT, and the name Ines Honnibal should be together):

[Ines Honnibal, MIT, software company Explosion, founders, programming languages, library, Matthew Honnibal, license, Cython, Python, developers, published, open-source software library, spaCy]

Solution

  • doc.ents is not a list of strings. It is a list of Span objects. When you print one, it prints its contents, but they are indeed individual objects, which is why set doesn't see they are duplicates. The clue to that is there are no quote marks in your print statement. If those were strings, you'd see quotation marks.

    You should try using doc.words instead of doc.ents. If that doesn't work for you, for some reason, you can do:

    my_tuple = list(set(e.text for e in doc.ents))