Search code examples
pythonweb-scrapingrequestxmlhttprequest

How to get/identify the specific data from the website?


I am trying to scrape a website with the code below:

import requests
import pandas as pd

with requests.Session() as connection:
    connection.headers.update(
        {
            "referer": "https://gmatclub.com/forum/decision-tracker.html",
            "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.86 YaBrowser/21.3.0.740 Yowser/2.5 Safari/537.36",
        }
    )
    _ = connection.get("https://gmatclub.com/forum/decision-tracker.html")
    endpoint = connection.get("https://gmatclub.com/api/schools/v1/forum/app-tracker-latest-updates?limit=50&year=all").json()
    for item in endpoint["statistics"]:
        print(item)

I am not sure of how to get the status of the admission under the decision tracker - Real-time updates.

enter image description here


Solution

  • The ticks, crossesand circles represent whether the applications are admitted, denied or pending for whatever reason. This info is found under status_id. In the sourcecode a mapping library can be found for the numbers. When we convert this to a python dict we can get the statuses and can also reconstruct the ticks etc.:

    import requests
    
    status_mapping = {1: { 'id':1,'class':'mainApplicationSubmitted','name':'Application Submitted' },
        3: { 'id':3,'class':'mainInterviewed','name':'interviewed' },
        4: { 'id':4,'class':'mainAdmitted','name':'admited' },
        5: { 'id':5,'class':'mainDenied','name':'denied' },
        6: { 'id':6,'class':'mainDenied','name':'denied' },
        7: { 'id':7,'class':'mainWaitListed','name':'waitlisted' },
        8: { 'id':8,'class':'mainWaitListed','name':'waitlisted' },
        9: { 'id':9,'class':'mainMatriculating','name':'matriculating' },
        10:{ 'id':10,'class':'mainWlAdmited','name':'admitted From WL' },
        11:{ 'id':11,'class':'mainResearching','name':'researching Or Writing Essays' },
        12:{ 'id':12,'class':'mainInvitedToInterview','name':'invited To Interview' },
        13:{ 'id':13,'class':'mainWithdrawn','name':'withdrawn Application '}}
    
    with requests.Session() as connection:
        connection.headers.update(
            {
                "referer": "https://gmatclub.com/forum/decision-tracker.html",
                "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.86 YaBrowser/21.3.0.740 Yowser/2.5 Safari/537.36",
            }
        )
        _ = connection.get("https://gmatclub.com/forum/decision-tracker.html")
        endpoint = connection.get("https://gmatclub.com/api/schools/v1/forum/app-tracker-latest-updates?limit=50&year=all").json()
        for item in endpoint["statistics"]:
            try:
                status = status_mapping[int(item['status_id'])]['name']
                if int(item['status_id']) in [4]:
                    status_short = 'green'
                elif int(item['status_id']) in [5,6]:
                    status_short = 'red'
                else:
                    status_short = 'grey'
                print(status, status_short)
            except:
                print(f"Key {item['status_id']} is missing from status_mapping. Check the entry at {item['date']} to see what this key represents and add it to status_mapping.")