Search code examples
command-linepycharmnamed-entity-recognitionpolyglot

How can I run this Polyglot token/tag extractor in PyCharm?


I am evaluating various named entity recognition (NER) libraries, and I'm trying out Polyglot.

Everything seems to be going well, but the instructions tell me to use this line in the command prompt:

!polyglot --lang en tokenize --input testdata/cricket.txt |  polyglot --lang en ner | tail -n 20

...which should give (in the example) this output:

,               O
which           O
was             O
equalled        O
five            O
days            O
ago             O
by              O
South           I-LOC
Africa          I-LOC
in              O
their           O
victory         O
over            O
West            I-ORG
Indies          I-ORG
in              O
Sydney          I-LOC
.               O

That's exactly the kind of output I need for my project, and it works exactly like I need it to work; however, I need to run that within my PyCharm interface, not the command line, and store the results in a pandas dataframe. How do I translate that command?


Solution

  • Assuming polyglot is installed correctly and proper environment is selected in pycharm. If not install polyglot in a new conda environment with necessary requirements. Create a new project and select that existing conda environment in pycharm. If language embeddings, ner models are not downloaded then they should be downloaded.

    Code:

    from polyglot.text import Text
    
    blob = """, which was equalled five days ago by South Africa in the victory over West Indies in Sydney."""
    text = Text(blob)
    text.language = "en"
    
    
    ## As list all detected entities
    print("As list all detected entities")
    print(text.entities)
    
    print()
    
    ## Separately shown detected entities
    print("Separately shown detected entities")
    for entity in text.entities:
        print(entity.tag, entity)
    
    print()
    
    ## Tokenized words of sentence
    print("Tokenized words of sentence")
    print(text.words)
    
    print()
    
    ## For each token try named entity recognition.
    ## Not very reliable it detects some words as not English and tries other languages.
    ## If other embeddings are not installed or text.language = "en" is commented then it may give error.
    print("For each token try named entity recognition")
    for word in text.words:
        text = Text(word)
        text.language = "en"
    
        ## Separately
        for entity in text.entities:
            print(entity.tag, entity)
    

    Output:

    As list all detected entities
    [I-LOC(['South', 'Africa']), I-ORG(['West', 'Indies']), I-LOC(['Sydney'])]
    
    Separately shown detected entities
    I-LOC ['South', 'Africa']
    I-ORG ['West', 'Indies']
    I-LOC ['Sydney']
    
    Tokenized words of sentence
    [',', 'which', 'was', 'equalled', 'five', 'days', 'ago', 'by', 'South', 'Africa', 'in', 'the', 'victory', 'over', 'West', 'Indies', 'in', 'Sydney', '.']
    
    For each token try named entity recognition
    I-LOC ['Africa']
    I-PER ['Sydney']