I am evaluating various named entity recognition (NER) libraries, and I'm trying out Polyglot.
Everything seems to be going well, but the instructions tell me to use this line in the command prompt:
!polyglot --lang en tokenize --input testdata/cricket.txt | polyglot --lang en ner | tail -n 20
...which should give (in the example) this output:
, O
which O
was O
equalled O
five O
days O
ago O
by O
South I-LOC
Africa I-LOC
in O
their O
victory O
over O
West I-ORG
Indies I-ORG
in O
Sydney I-LOC
. O
That's exactly the kind of output I need for my project, and it works exactly like I need it to work; however, I need to run that within my PyCharm interface, not the command line, and store the results in a pandas dataframe. How do I translate that command?
Assuming polyglot is installed correctly and proper environment is selected in pycharm. If not install polyglot in a new conda environment
with necessary requirements. Create a new project and select that existing conda environment in pycharm. If language embeddings
, ner
models are not downloaded
then they should be downloaded.
Code:
from polyglot.text import Text
blob = """, which was equalled five days ago by South Africa in the victory over West Indies in Sydney."""
text = Text(blob)
text.language = "en"
## As list all detected entities
print("As list all detected entities")
print(text.entities)
print()
## Separately shown detected entities
print("Separately shown detected entities")
for entity in text.entities:
print(entity.tag, entity)
print()
## Tokenized words of sentence
print("Tokenized words of sentence")
print(text.words)
print()
## For each token try named entity recognition.
## Not very reliable it detects some words as not English and tries other languages.
## If other embeddings are not installed or text.language = "en" is commented then it may give error.
print("For each token try named entity recognition")
for word in text.words:
text = Text(word)
text.language = "en"
## Separately
for entity in text.entities:
print(entity.tag, entity)
Output:
As list all detected entities
[I-LOC(['South', 'Africa']), I-ORG(['West', 'Indies']), I-LOC(['Sydney'])]
Separately shown detected entities
I-LOC ['South', 'Africa']
I-ORG ['West', 'Indies']
I-LOC ['Sydney']
Tokenized words of sentence
[',', 'which', 'was', 'equalled', 'five', 'days', 'ago', 'by', 'South', 'Africa', 'in', 'the', 'victory', 'over', 'West', 'Indies', 'in', 'Sydney', '.']
For each token try named entity recognition
I-LOC ['Africa']
I-PER ['Sydney']