Search code examples
pythoncsvsplittokenpos-tagger

Split text into tokens on different rows in a dataframe


I am new to this but I am trying to split text in a pandas dataframe into individual rows consisting of each tokens of the text and also its respective POS and TAG. For example:

            Text
   1        Police officers arrest teen.
   2        Man agrees to help.

What i am trying to achieve here is:

Sentence#  Token     POS   Tag
   1       Police    NNS   B-NP
           officers  NNS   I-NP
           arrest    VBP   B-VP
           teen      NN    B-NP
   2       Man       NNP   B-NP
           agrees    VBZ   B-VP
           to        TO    B-VP
           help      VB    B-VP

Solution

  • The nltk module can help you do what you want. This code makes use of nltk to create a new DataFrame with similar output to your desired output. In order to get matching tags to your desired output, you will likely need to supply your own chunk parser. I am no expert in POS and IOB tagging.

    import pandas as pd
    from nltk import word_tokenize, pos_tag, tree2conlltags, RegexpParser
    
    # orig data
    d = {'Text': ["Police officers arrest teen.", "Man agrees to help."]}
    # orig DataFrame
    df = pd.DataFrame(data = d)
    
    # new data
    new_d = {'Sentence': [], 'Token': [], 'POS': [], 'Tag': []}
    
    # grammar taken from nltk.org
    grammar = r"NP: {<[CDJNP].*>+}"
    parser = RegexpParser(grammar)
    
    for idx, row in df.iterrows():
        temp = tree2conlltags(parser.parse(pos_tag(word_tokenize(row["Text"]))))
        new_d['Token'].extend(i[0] for i in temp)
        new_d['POS'].extend(i[1] for i in temp)
        new_d['Tag'].extend(i[2] for i in temp)
        new_d['Sentence'].extend([idx + 1] * len(temp))
    
    # new DataFrame
    new_df = pd.DataFrame(data = new_d)
    
    print(f"***Original DataFrame***\n\n {df}\n")
    print(f"***New DataFrame***\n\n {new_df}")
    

    Output:

    ***Original DataFrame***
    
                                Text
    0  Police officers arrest teen.
    1           Man agrees to help.
    
    ***New DataFrame***
    
        Sentence     Token  POS   Tag
    0         1    Police  NNP  B-NP
    1         1  officers  NNS  I-NP
    2         1    arrest  VBP     O
    3         1      teen   NN  B-NP
    4         1         .    .     O
    5         2       Man   NN  B-NP
    6         2    agrees  VBZ     O
    7         2        to   TO     O
    8         2      help   VB     O
    9         2         .    .     O
    

    Note after doing a pip install of nltk, before the above code can run, you will likely have to call nltk.download a few times. The error message you get should tell you what to execute. For example, you will likely need to execute this

    >>> import nltk
    >>> nltk.download('punkt')
    >>> nltk.download('averaged_perceptron_tagger')