Search code examples
pythonweb-scrapingbeautifulsouppython-requests-html

How to scrape a table but 'not a table' from a page, using python?


Humble greetings and welcome to anyone willing to spend time here. I shall introduce myself as a very green student of data science and also python. This thread is meant to gain insight from rather more fortunate minds capable of deeper understanding within the realm of python.

the table im trying to scrape

As we can see, the value for each row itself could be found easily on the page inspection. But it seems that they all are using the same class name. As for now, i'm afraid i couldnt even find the right keyword to search for any working method in google.

These are the codes that i've tried. They dont work and embaressing, but i have to show it anyway. Ive tried fiddling by adding .content, .text, find, find_all, but i understand that my failure lies at even deeper fundamental core.

from bs4 import BeautifulSoup
import requests
from csv import writer
import pandas as pd

url= 'https://m4.mobilelegends.com/stats'
page = requests.get(url)

soup = BeautifulSoup(page.text, 'html.parser')
lists = soup.find('div', class_="m4-team-stats-scroll")

with open('m4stats_team.csv', 'w', encoding='utf8', newline='') as f:
    thewriter = writer(f)
    header = ['Team', 'Win Rate', 'Average KDA', 'Average Kills', 'average Deaths', 'Average Assists', 'Average Game Time', 'Average Lord Kills', 'Average Tortoise Kills', 'Average Towers Destroy', 'First Blood Rate', 'Hero Pool']
    thewriter.writerow(header)

    for list in lists:
        team = list.find_all('p', class_="h3 pl-5 whitespace-nowrap hidden xl:block")
        awr = list.find_all('p', class_="h4")
        akda = list.find('p', class_="h4").text
        akill = list.find('p', class_="h4").text
        adeath = list.find('p', class_="h4").text
        aassist = list.find('p', class_="h4").text
        atime = list.find('p', class_="h4").text
        aalord = list.find('p', class_="h4").text
        atortoise = list.find('p', class_="h4").text
        atower = list.find('p', class_="h4").text
        firstblood = list.find('p', class_="h4").text
        hrpool = list.find('p', class_="h4").text


        info = [team, awr, akda, akill, adeath, aassist, atime, aalord, atortoise, atower, firstblood, hrpool]
        thewriter.writerow(info)

pd.read_csv('m4stats_team.csv').head()

What am i expecting: Any kind of insight. Whether if it's clue, keyword, code snippet, i do appreciate and mostfully thankful for any kind of guidance. I am not asking for somehow getting the complete scrapped CSV, as i couldve done it manually. At these point i want to be able to do basic webscraping myself.


Solution

  • You can iterate over rows in the table and its items.

    from bs4 import BeautifulSoup
    import requests
    
    page = requests.get('https://m4.mobilelegends.com/stats')
    page.raise_for_status()
    
    page = BeautifulSoup(page.content)
    
    table = page.find("div", class_="m4-team-stats-scroll")
    
    with open("table.csv", "w") as file:
        for row in table.find_all("div", class_="m4-team-stats"):
            items = row.find_all("div", class_="col-span-1")
            # write into file in csv format, use map to extract text from items
            file.write(",".join(map(lambda item: item.text, items)) + "\n")
    

    Display output:

    import pandas as pd
    
    df = pd.read_csv("table.csv")
    
    print(df)
    
    # Outputs:
    """
          Team ↓Win Rate  ...  ↓First Blood Rate  ↓Hero pool
    0     echo     72.0%  ...              48.0%          37
    1      rrq     60.9%  ...              60.9%          37
    2       tv     60.0%  ...              60.0%          29
    3     fcon     55.0%  ...              85.0%          32
    4      inc     53.3%  ...              26.7%          31
    5     onic     52.9%  ...              47.1%          39
    6     blck     52.2%  ...              47.8%          31
    7   rrq-br     46.2%  ...              30.8%          32
    8      thq     45.5%  ...              63.6%          27
    9      s11     42.9%  ...              28.6%          26
    10     tdk     37.5%  ...              62.5%          24
    11      ot     28.6%  ...              28.6%          21
    12     mvg     20.0%  ...              20.0%          15
    13  rsg-sg     20.0%  ...              60.0%          17
    14    burn      0.0%  ...              20.0%          21
    15     mdh      0.0%  ...              40.0%          18
    
    [16 rows x 12 columns]
    """