Search code examples
pythonbeautifulsouptagsaiml

Exclude a tag (<pattern>) inside a tag (<topic>) on a result set using BeautifulSoup


I'm just new to web scraping using Python, and currently I'm using BeautifulSoup for data extraction. I have this .aiml file(xml) wherein I wanted to extract all the data from the tags pattern which are NOT INCLUDED inside the topic tag.

I'm already getting all the pattern values, but the challenge here is that, those pattern that has a parent tag of topic, shouldn't be included on a result set.

Here is the aiml file:

<?xml version = "1.0" encoding = "UTF-8"?>

<aiml version="1.0.1" encoding="UTF-8">
  <topic name="botdog">
   <category>
      <pattern>MY DOG'S NAME IS *</pattern>
      <template>
         That is interesting that you have a dog named <set name="dog"><star/></set>
      </template>  
   </category>

   <category>
      <pattern>WHAT IS MY DOG'S NAME</pattern>
      <template>
         Your dog's name is <get name="dog"/>.
      </template>  
   </category>  
  </topic>

  <topic name="botcat">
   <category>
      <pattern>MY CAT'S NAME IS *</pattern>
      <template>
         That is interesting that you have a cat named <set name="cat"><star/></set>
      </template>  
   </category>

   <category>
      <pattern>WHAT IS MY CAT'S NAME</pattern>
      <template>
         Your cat's name is <get name="cat"/>.
      </template>  
   </category>  
  </topic>


  <category>
      <pattern>HELLO ALICE</pattern>
      <template>
         Hello User
      </template>
   </category>

   <category>
      <pattern>HOW ARE YOU</pattern>
      <template>
         I'm fine
      </template>
   </category>
</aiml>

Python code(Flask):

@extract.route('/')
def index_page():
    folder = 'templates/topic.aiml'
    with open(folder, 'r') as myfile:
        soup = BeautifulSoup(myfile.read(), 'html.parser')
    data_topic = [match.pattern.text for match in soup.find_all('category')]

    print(data_topic)


    # data = " ".join(data_set)

    return jsonify({'data_set': data_topic})

Return value upon print() is:

["MY DOG'S NAME IS *", "WHAT IS MY DOG'S NAME", "MY CAT'S NAME IS *", "WHAT IS MY CAT'S NAME", 'HELLO ALICE', 'HOW ARE YOU']

Which should only be like this since it has no parent tag topic: ['HELLO ALICE', 'HOW ARE YOU']


Solution

  • Try this:

    @extract.route('/')
    def index_page():
        folder = 'templates/topic.aiml'
    
        with open(folder, 'r') as myfile:
            soup = BeautifulSoup(myfile.read(), 'html.parser')
    
        data = []
        for cat in soup.find_all('category'):
            if cat.parent.name == "topic": continue
            data += [cat.find("pattern").text]
    
        print(data)
        return jsonify({'data_set': data})
    

    Hope this helps! Check out the docs for more examples.