Search code examples
pythonbeautifulsoupscreen-scraping

scraping nested xml using beautiful soup


xml = """<f transform="translate(7,7)" class="SoccerPlayer SoccerPlayer-11 Team-Away  Outcome-Complete" data-id="8">
    <rect x="-15" y="-15" width="30" height="30" transform="rotate(0)" class="SoccerShape"></rect>
    <text x="0" y="7" text-anchor="middle" transform="translate(0,0)rotate(0)">11</text>
    <text class="Soccer-Hidden">
        <div>
            <h3>
                <span class="Soccer-Key">
            Suc passes
          </span>
                <span class="Soccer-Value">
            82
          </span>
            </h3>
            <p>
          Ronaldo
        </p>
        </div>
    </text>
</f>"""

I'm currently trying to scrape the above xml, by using soup. Specifically

from bs4 import BeautifulSoup as bs
soup=bs(xml, "xml")
for pr in soup.find_all("f")):
    try:
        player = pr['class']
        time = pr['data-id']
    except:
        pass
    print(player,time)

This is working as intended.

I am having difficulties scraping the nested information in the <text class="Soccer-Hidden"> tag. I'm trying to scrape the <span class="Soccer-Key">, <span class="Soccer-Value"> and also the value between the <p> tags (the Ronaldo text).

What can I add to my code to get these? Thanks


Solution

  • Try with the method findChildren, giving class options in a dictionary:

    for pr in soup.find_all("f"):
        soc_key = pr.findChildren("span" , { "class" : "Soccer-Key" })[0].text
        soc_value = pr.findChildren("span" , { "class" : "Soccer-Value" })[0].text
        name = pr.findChildren("p")[0].text
        print(soc_key, soc_value, name)
    

    will get you Suc passes 82 Ronaldo with some extra space you can remove with strip()