Search code examples
pythonhtmlbeautifulsouphtml-parsingkaggle

Python:Getting text from html using Beautifulsoup


I am trying to extract the ranking text number from this link link example: kaggle user ranking no1. More clear in an image:

enter image description here

I am using the following code:

def get_single_item_data(item_url):
    sourceCode = requests.get(item_url)
    plainText = sourceCode.text
    soup = BeautifulSoup(plainText)
    for item_name in soup.findAll('h4',{'data-bind':"text: rankingText"}):
        print(item_name.string)

item_url = 'https://www.kaggle.com/titericz'   
get_single_item_data(item_url)

The result is None. The problem is that soup.findAll('h4',{'data-bind':"text: rankingText"}) outputs:

[<h4 data-bind="text: rankingText"></h4>]

but in the html of the link when inspecting this is like:

<h4 data-bind="text: rankingText">1st</h4>. It can be seen in the image:

enter image description here

Its clear that the text is missing. How can I overpass that?

Edit: Printing the soup variable in the terminal I can see that this value exists: enter image description here

So there should be a way to access through soup.

Edit 2: I tried unsuccessfully to use the most voted answer from this stackoverflow question. Could be a solution around there.


Solution

  • If you aren't going to try browser automation through selenium as @Ali suggested, you would have to parse the javascript containing the desired information. You can do this in different ways. Here is a working code that locates the script by a regular expression pattern, then extracts the profile object, loads it with json into a Python dictionary and prints out the desired ranking:

    import re
    import json
    
    from bs4 import BeautifulSoup
    import requests
    
    
    response = requests.get("https://www.kaggle.com/titericz")
    soup = BeautifulSoup(response.content, "html.parser")
    
    pattern = re.compile(r"profile: ({.*}),", re.MULTILINE | re.DOTALL)
    script = soup.find("script", text=pattern)
    
    profile_text = pattern.search(script.text).group(1)
    profile = json.loads(profile_text)
    
    print profile["ranking"], profile["rankingText"]
    

    Prints:

    1 1st