Search code examples
web-scrapingvisual-studio-codebeautifulsoupgoogle-colaboratory

BeautifulSoup.text returns blank string in VSCode, but works fine in Google Colab


I am trying to scrape this website https://understat.com/league/EPL.

Once I have parsed the page:

import json
from bs4 import BeautifulSoup
from urllib.request import urlopen
scrape_urlEPL="https://understat.com/league/EPL"
page_connect=urlopen(scrape_urlEPL)
page_html=BeautifulSoup(page_connect, "html.parser")

Then I search for "script" in the html.

page_html.findAll(name="script")

This gives me a list of all occurences of "script". Say I want to extract the text from the third element. Just printing the html for this shows a valid output.

page_html.findAll(name="script")[3]

The output:

<script>
    var playersData = JSON.parse('\x5B\x7B\x22id\x22\x3A\x221389\x22,\x22player_name\x22\x3A\x22Jorginho\x22,\x22games\x22\x3A\x2228\x22,\x22time\x22\x3A\x222022\x22,\x22goals\x22\x3A\x227\x22,\x22xG\x22\x3A\x226.972690678201616\x22,\x22assists\x22\x3A\x221\x22,\x22xA\x22\x3A\x221.954869382083416\x22,\x22shots\x22\x3A\x2214\x22,\x22key_passes\x22\x3A\x2224\x22,\x22yellow_cards\x22\x3A\x222\x22,\x22red_cards\x22\x3A\x220\x22,\x22position\x22\x3A\x22M\x20S\x22,\x2....

Now if I want to extract the text from this,

page_html.findAll(name="script")[3].text

This gives an empty string ''.

However the same code works fine in Google Colab and returns:

'\n\tvar playersData\t= JSON.parse('\\x5B\\x7B\\x22id\\x22\\x3A\\x22647\\x22,\\x22player_name\\x22\\x3A\\x22Harry\\x20Kane\\x22,\\x22games\\x22\\x3A\\x2235\\x22,\\x22time\\x22\\x3A\\x223097\\x22,\\x22goals\\x22\\x3A\\x2223\\x22,\\x22xG\\x22\\x3A\\x2222.174858909100294\\x22,\\x22assists\\x22\\x3A\\x2214\\x22,\\x22xA\\x22\\x3A\\x227.577093588188291\\x22,\\x22shots\\x22\\x3A\\x22138\\x22,\\x22key_passes\\x22\\x3A\\x2249...' 

which is as expected. I don't understand why this error comes up in VSCode.


Solution

  • Be informed that script TAG is only holding string which IS NOT a TEXT.

    JSON.PARSE is a JavaScript function which parse a string

    You've to use .string instead of .text

    import httpx
    import trio
    from bs4 import BeautifulSoup
    
    
    async def main():
        async with httpx.AsyncClient(timeout=None) as client:
            r = await client.get('https://understat.com/league/EPL')
            soup = BeautifulSoup(r.text, 'lxml')
            goal = soup.select('script')[3].string
            print(goal)
    
    
    if __name__ == "__main__":
        trio.run(main)
    
    

    Ref : Bs4 difference between string and text