Search code examples
python-3.xbeautifulsouphtml-parsing

How to iterate through HTML and parse specific data?


The python code below is to extract from html specific data and it works for just one instance contained within the html.

What I need ia code to iterate through an html with several instances and retrieve the specific information. So, how could I achieve that?

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
<title>Exported Data</title>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <link href="css/style.css" rel="stylesheet"/>
  <script src="js/script.js" type="text/javascript">
  </script>
 </head>
 <body onload="CheckLocation();">
  <div class="page_wrap">
   <div class="page_header">
    <div class="content">
     <div class="text bold">
πŸ€–πŸ₯‡ π‘¬π’‚π’”π’š 𝑩𝒐𝒕 - 𝑢𝒗𝒆𝒓 2.5 
     </div>
    </div>
   </div>
   <div class="page_body chat_page">
    <div class="history">
     <div class="message service" id="message-1">
      <div class="body details">
9 March 2023
      </div>
     </div>
     <div class="message default clearfix" id="message3984">
      <div class="pull_left userpic_wrap">
       <div class="userpic userpic2" style="width: 42px; height: 42px">
        <div class="initials" style="line-height: 42px">
?
        </div>
       </div>
      </div>
      <div class="body">
       <div class="pull_right date details" title="09.03.2023 00:27:10 UTC-03:00">
00:27
       </div>
       <div class="from_name">
πŸ€–πŸ₯‡ π‘¬π’‚π’”π’š 𝑩𝒐𝒕 - 𝑢𝒗𝒆𝒓 2.5 
       </div>
       <div class="text">
Easy Bot - Over 2.5<br><br>πŸ† Liga: Premiership<br>🚦 Entrada: Over 2.5 FT<br>⚽ Jogos: βœ… 03:30  03:33  03:36 ( 03:39)<br><br><strong>Link: </strong><a href="https://www.bet365.com/#/AVR/B146/R%5E1/">https://www.bet365.com/#/AVR/B146/R%5E1/</a><br><br>πŸ€ 24h:100% de acerto nas ΓΊltimas 24h<br><br>βœ…βœ…βœ…βœ…βœ…βœ… .
       </div>
      </div>
     </div>
     <div class="message default clearfix" id="message3985">
      <div class="pull_left userpic_wrap">
       <div class="userpic userpic2" style="width: 42px; height: 42px">
        <div class="initials" style="line-height: 42px">
?
        </div>
       </div>
      </div>
      <div class="body">
       <div class="pull_right date details" title="09.03.2023 00:45:16 UTC-03:00">
00:45
       </div>
       <div class="from_name">
πŸ€–πŸ₯‡ π‘¬π’‚π’”π’š 𝑩𝒐𝒕 - 𝑢𝒗𝒆𝒓 2.5 
       </div>
       <div class="text">
Easy Bot - Over 2.5<br><br>πŸ† Liga: Premiership<br>🚦 Entrada: Over 2.5 FT<br>⚽ Jogos: βœ… 03:48  03:51  03:54 ( 03:57)<br><br><strong>Link: </strong><a href="https://www.bet365.com/#/AVR/B146/R%5E1/">https://www.bet365.com/#/AVR/B146/R%5E1/</a><br><br>πŸ€ 24h:100% de acerto nas ΓΊltimas 24h<br><br>βœ…βœ…βœ…βœ…βœ…βœ… .
       </div>
      </div>
     </div>
     </div>
    </div>
   </div>
  </div>
 </body>
</html>

Solution

  • Well this one is somewhat more complex than in your previous question, so you need more acrobatics:

    for b in soup.select('div[class="body"]'):
        d_str = b.select_one('div.date.details')['title']
        calendar = d_str.split(" ")
        print("Date: ",calendar[0])
        print("Time: ",calendar[1])
        targets = b.select('div.text')
        for target in targets:
            for sts in target.stripped_strings:
                if "⚽ Jogos: " in sts:   
                    jugos = [elem for elem in sts.split('⚽ Jogos: ')[1].replace('( ',"(").split(" ") if elem]            
                    if "βœ…" in jugos:
                        ind = jugos.index('βœ…')+1
                        print("Checkmarked: ", ind)
                        jugos.remove("βœ…")
                        print(jugos)
                    else:
                        print(jugos)
                        print("Checkmarked: NA")
            print('------------------------------------')
    

    Output:

    Date:  09.03.2023
    Time:  00:27:10
    Checkmarked:  1
    ['03:30', '03:33', '03:36', '(03:39)']
    ------------------------------------
    Date:  09.03.2023
    Time:  00:45:16
    Checkmarked:  1
    ['03:48', '03:51', '03:54', '(03:57)']