Search code examples
pythonpython-3.xpandaslinguistics

Pattern table to Pandas DataFrame


I am working with Python "Pattern.en" package that gives me the subject, object and other details about a particular sentence.

But I want to store this output into another variable or a Dataframe for further processing which I am not able to do so.

Any inputs on this will be helpful.

Sample code is mentioned below for reference.

from pattern.en import parse
from pattern.en import pprint
import pandas as pd

input = parse('I want to go to the Restaurant as I am hungry very much')
print(input)    
I/PRP/B-NP/O want/VBP/B-VP/O to/TO/I-VP/O go/VB/I-VP/O to/TO/O/O the/DT/B-NP/O Restaurant/NNP/I-NP/O as/IN/B-PP/B-PNP I/PRP/B-NP/I-PNP am/VBP/B-VP/O hungry/JJ/B-ADJP/O very/RB/I-ADJP/O much/JJ/I-ADJP/O

pprint(input)

      WORD   TAG    CHUNK    ROLE   ID     PNP    LEMMA                                                
         I   PRP    NP       -      -      -      -       
      want   VBP    VP       -      -      -      -       
        to   TO     VP ^     -      -      -      -       
        go   VB     VP ^     -      -      -      -       
        to   TO     -        -      -      -      -       
       the   DT     NP       -      -      -      -       
Restaurant   NNP    NP ^     -      -      -      -       
        as   IN     PP       -      -      PNP    -       
         I   PRP    NP       -      -      PNP    -       
        am   VBP    VP       -      -      -      -       
    hungry   JJ     ADJP     -      -      -      -       
      very   RB     ADJP ^   -      -      -      -       
      much   JJ     ADJP ^   -      -      -      -       

Please note the output of both print and pprint statements. I am trying to store either one of them into a variable. It would be better if I can store the output of pprint statement into a Dataframe as it is printing in tabular format.

But when I try to do so I encounter the error mentioned below

df = pd.DataFrame(input)

ValueError: DataFrame constructor not properly called!


Solution

  • Taking source of table function, I come out with this

    from pattern.en import parse
    from pattern.text.tree import WORD, POS, CHUNK, PNP, REL, ANCHOR, LEMMA, IOB, ROLE, MBSP, Text
    import pandas as pd
    
    def sentence2df(sentence, placeholder="-"):
        tags  = [WORD, POS, IOB, CHUNK, ROLE, REL, PNP, ANCHOR, LEMMA]
        tags += [tag for tag in sentence.token if tag not in tags]
        def format(token, tag):
            # Returns the token tag as a string.
            if   tag == WORD   : s = token.string
            elif tag == POS    : s = token.type
            elif tag == IOB    : s = token.chunk and (token.index == token.chunk.start and "B" or "I")
            elif tag == CHUNK  : s = token.chunk and token.chunk.type
            elif tag == ROLE   : s = token.chunk and token.chunk.role
            elif tag == REL    : s = token.chunk and token.chunk.relation and str(token.chunk.relation)
            elif tag == PNP    : s = token.chunk and token.chunk.pnp and token.chunk.pnp.type
            elif tag == ANCHOR : s = token.chunk and token.chunk.anchor_id
            elif tag == LEMMA  : s = token.lemma
            else               : s = token.custom_tags.get(tag)
            return s or placeholder
    
        columns = [[format(token, tag) for token in sentence] for tag in tags]
        columns[3] = [columns[3][i]+(iob == "I" and " ^" or "") for i, iob in enumerate(columns[2])]
        del columns[2]
        header = ['word', 'tag', 'chunk', 'role', 'id', 'pnp', 'anchor', 'lemma']+tags[9:]
    
        if not MBSP:
            del columns[6]
            del header[6]
    
        return pd.DataFrame(
            [[x[i] for x in columns] for i in range(len(columns[0]))],
            columns=header,
        )
    

    Usage

    >>> string = parse('I want to go to the Restaurant as I am hungry very much')
    >>> sentence = Text(string, token=[WORD, POS, CHUNK, PNP])[0]
    >>> df = sentence2df(sentence)
    >>> print(df)
              word  tag   chunk role id  pnp lemma
    0            I  PRP      NP    -  -    -     -
    1         want  VBP      VP    -  -    -     -
    2           to   TO    VP ^    -  -    -     -
    3           go   VB    VP ^    -  -    -     -
    4           to   TO       -    -  -    -     -
    5          the   DT      NP    -  -    -     -
    6   Restaurant  NNP    NP ^    -  -    -     -
    7           as   IN      PP    -  -  PNP     -
    8            I  PRP      NP    -  -  PNP     -
    9           am  VBP      VP    -  -    -     -
    10      hungry   JJ    ADJP    -  -    -     -
    11        very   RB  ADJP ^    -  -    -     -
    12        much   JJ  ADJP ^    -  -    -     -