Search code examples
pythonpython-3.7python-pattern

How to convert a Text object from a parsetree output of module Pattern in python?


I have a list of words like this:

['Urgente', 'Recibimos', 'Info']

I used the parsetree (parsetree(x, lemmata = True) function to convert the words and the output for each Word is this:

[[Sentence('urgente/JJ/B-ADJP/O/urgente')],
[Sentence('recibimos/NN/B-NP/O/recibimos')],
[Sentence('info/NN/B-NP/O/info')]]

Each component of the list has the type pattern.text.tree.Text.

I need to obtain only the group of words into the parenthesis but I don´t know how to do this, I need this output:

[urgente/JJ/B-ADJP/O/urgente,
recibimos/NN/B-NP/O/recibimos,
info/NN/B-NP/O/info]

I use str to convert to string each component to the list but this changes all output.


Solution

  • From their documentation, there doesn't seem to be a direct method or property to get what you want.

    But I found that a Sentence object can be printed as Sentence('urgente/JJ/B-ADJP/O/urgente') using repr. So I looked at the source code for the __repr__ implementation to see how it is formed:

    def __repr__(self):
        return "Sentence(%s)" % repr(" ".join(["/".join(word.tags) for word in self.words]))
    

    It seems that the string "in parenthesis" is a combination of words and tags. You can then reuse that code, knowing that if you already have pattern.text.tree.Text objects, "a Text is a list of Sentence objects. Each Sentence is a list of Word objects." (from the Parse trees documentation).

    So here's my hacky solution:

    parsed = list()
    for data in ['Urgente', 'Recibimos', 'Info']:
        parsed.append(parsetree(data, lemmata=True))
    
    output = list()
    for text in parsed:
        for sentence in text:
            formatted = " ".join(["/".join(word.tags) for word in sentence.words])
            output.append(str(formatted))
    
    print(output)
    

    Printing output gives:

    ['Urgente/NNP/B-NP/O/urgente', 'Recibimos/NNP/B-NP/O/recibimos', 'Info/NNP/B-NP/O/info']
    

    Note that this solution results in a list of strs (losing all the properties/methods from the original parsetree output).