Search code examples
pythonjsonpython-3.xparsingtext

ignoring data using ttp module in python


I am going to explain the problem I faced with the following sample. I am able to parse the following data with the following config. When I used the {{ignore}} command, it helps me to get the line as the line matches the correct template, and ignore the data that I don't want to have.

from ttp import ttp
import json

data_to_parse = """
1.peace in the world
2.peace in the world world 
3.peace in the world world world 
"""

To parse this data I can use the following template.

ttp_template = """
<group name="Quote">
{{peace}} in the {{world}}
</group>
<group name="Quote">
{{peace}} in the {{world}} {{ignore}}
</group>
<group name="Quote">
{{peace}} in the {{world}} {{ignore}} {{ignore}}
</group>
"""

With the following config, I can have the parsed data as I wish:

def parser(data_to_parse):

    parser = ttp(data=data_to_parse, template=ttp_template)
    parser.parse()

    # print result in JSON format
    results = parser.result(format='json')[0]
    #print(results)

    #converting str to json. 
    result = json.loads(results)

    print(result)

parser(data_to_parse)

See the output I have:

enter image description here

The problem is that I can not guess how many "world" at the of the each line, and I don't want to keep writing {{ignore}} commands to get the required line and avoid the word that I don't want to have. For example, if I add the following line in my data, it will not be catched with the template I shared above, I will need to add one more {{ignore}} to capture following data.

4.peace in the world world world world

What I have understood that the reason for this the ttp seperates the words from each space. For example, incase I have _ instead of 'space' as following 3.peace in the world_world_world I can get the data with a simple line in my template. However, in my data, I have lines with spaces that I need to be aware of and capture these lines as well.

So the question is that is there any way to facilitate this process? As you see that I have a workaround, however I need to find out a simple way to resolve the issue. Highly appreciate for any advise.


Solution

  • I have found a way to resolve this. {{ name | PHRASE }} or {{ name | ORPHRASE }} can be used for this purpose.

    {{ name | PHRASE }}
    

    This pattern matches any phrase - collection of words separated by single space character, such as “word1 word2 word3”.

    {{ name | ORPHRASE }}
    

    In many cases data that needs to be extracted can be either a single word or a phrase, the most prominent example - various descriptions, such as interface descriptions, BGP peers descriptions etc. ORPHRASE allows to match and extract such a data.