Search code examples
pythonlistnltkpos-tagger

How to output NLTK pos_tag in the string instead of a list?


I need to run nltk.pos_tag on a large dataset and need to have its output like the one that is offered by Stanford tagger.

For example while running the following code I have;

import nltk
text=nltk.word_tokenize("We are going out.Just you and me.")
print nltk.pos_tag(text)

the output is: [('We', 'PRP'), ('are', 'VBP'), ('going', 'VBG'), ('out.Just', 'IN'), ('you', 'PRP'), ('and', 'CC'), ('me', 'PRP'), ('.', '.')]

In the case that I need it to be like:

 We/PRP are/VBP going/VBG out.Just/NN you/PRP and/CC me/PRP ./.

I prefer to not using string functions and need a dirrect output because the amount of the text is so high and it adds lots of time complexities to the processing


Solution

  • In short:

    ' '.join([word + '/' + pos for word, pos in tagged_sent]
    

    In long:

    I think you're overthinking about using string functions to concat the strings, it's really not that expensive.

    import time
    from nltk.corpus import brown
    
    tagged_corpus = brown.tagged_sents()
    
    start = time.time()
    
    with open('output.txt', 'w') as fout:
        for i, sent in enumerate(tagged_corpus):
            print(' '.join([word + '/' + pos for word, pos in sent]), end='\n', file=fout)
    
    end = time.time() - start
    print (i, end)
    

    It took 2.955 seconds on my laptop for all 57339 sentences from the brown corpus.

    [out]:

    $ head -n1 output.txt 
    The/AT Fulton/NP-TL County/NN-TL Grand/JJ-TL Jury/NN-TL said/VBD Friday/NR an/AT investigation/NN of/IN Atlanta's/NP$ recent/JJ primary/NN election/NN produced/VBD ``/`` no/AT evidence/NN ''/'' that/CS any/DTI irregularities/NNS took/VBD place/NN ./.
    

    But using string to concatenate the word and POS can cause trouble later on when you need to read your tagged output, e.g.

    >>> from nltk import pos_tag
    >>> tagged_sent = pos_tag('cat / dog'.split())
    >>> tagged_sent_str = ' '.join([word + '/' + pos for word, pos in tagged_sent])
    >>> tagged_sent_str
    'cat/NN //CD dog/NN'
    >>> [tuple(wordpos.split('/')) for wordpos in tagged_sent_str.split()]
    [('cat', 'NN'), ('', '', 'CD'), ('dog', 'NN')]
    

    If you want to saved the tagged output and then read it later, it's better to use pickle to save the tagged_output, e.g.

    >>> import pickle
    >>> tagged_sent = pos_tag('cat / dog'.split())
    >>> with open('tagged_sent.pkl', 'wb') as fout:
    ...     pickle.dump(tagged_sent, fout)
    ... 
    >>> tagged_sent = None
    >>> tagged_sent
    >>> with open('tagged_sent.pkl', 'rb') as fin:
    ...     tagged_sent = pickle.load(fin)
    ... 
    >>> tagged_sent
    [('cat', 'NN'), ('/', 'CD'), ('dog', 'NN')]