Search code examples
pythonweb-scrapingxmlhttprequest

Web-scraping from pages with the same link


I am trying to scrape some information from this website: https://www.nordnet.se/marknaden/aktiekurser?sortField=name&sortOrder=asc&exchangeCountry=SE&exchangeList=se%3Alargecapstockholmsek.

What I want to do is grab the sector information for each company, which is provided under the "Om bolaget"-tab in the company-specific pages. More specifically the information I want to get is in the "Sektor" and "Branch" fields. The links to the company specific pages can easily be obtained with requests and BeautifulSoup in python.

When making a get request to these links, the response sometimes contains the wanted information in the following form "sector: ..." and "sector_group: ...", but not always. One example when it works is for Latour https://www.nordnet.se/marknaden/aktiekurser/16099736-latour-investmentab-b, and one example when is doesn't work is for EQT https://www.nordnet.se/marknaden/aktiekurser/17117956-eqt.

Note that I see that an XHR-request (POST-request) is being made when pressing "Om bolaget", but I am not sure how to exploit it.

The code I use to grab the sector information from a company-specific page is provided below:

import requests
from bs4 import BeautifulSoup
import re

def get_sector(url):

    sector, sector_group = None, None
    resp = requests.get(url)
    soup = BeautifulSoup(resp.text, 'html.parser')
    tags = soup.findAll('script')
    for tag in tags:
        content = tag.get_text()
        content = content.replace('\\', '')
        if '__initialState__' not in content:
            continue
        try:
            sector = re.findall(r'"sector":"\w+"', content)[0]
            sector = json.loads('{' + sector + '}')
            sector = sector['sector']
        except IndexError:
            print(url)
            print('Sector not found')

        try:
            sector_group = re.findall(r'"sector_group":"\w+"', content)[0]
            sector_group = json.loads('{' + sector_group + '}')
            sector_group = sector_group['sector_group']
        except IndexError:
            print('Sector Group not found')

        break

    return sector, sector_group

Any input would be much appreciated.


Solution

  • To get Om bolaget batch you have to get ntag from https://www.nordnet.se/api/2/login/anonymous response headers. You can take it once and use later in other requests. Best way is to userequests.session()for that. Indata` 17117956 and 16099736 should be variables:

    headers = {
        'Connection': 'keep-alive',
        'Content-Length': '0',
        'Pragma': 'no-cache',
        'Cache-Control': 'no-cache',
        'Origin': 'https://www.nordnet.se',
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36',
        'ntag': 'NO_NTAG_RECEIVED_YET',
        'content-type': 'application/x-www-form-urlencoded',
        'accept': 'application/json',
        'client-id': 'NEXT',
        'DNT': '1',
        'Sec-Fetch-Site': 'same-origin',
        'Sec-Fetch-Mode': 'cors',
        'Referer': 'https://www.nordnet.se/se',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'ru,en-US;q=0.9,en;q=0.8,tr;q=0.7',
    }
    
    with requests.session() as s:
        r = s.post('https://www.nordnet.se/api/2/login/anonymous', headers=headers)
    
        headers['ntag'] = r.headers['ntag']
        headers['content-type'] = 'application/json'
        headers['accept'] = 'application/json'
    
        for company_id in ['17117956', '16099736']:
            data = '{"batch":"[{\\"relative_url\\":\\"company_data/keyfigures/' + company_id + '\\",\\"method\\":\\"GET\\"},{\\"relative_url\\":\\"company_data/yearlyfinancial/' + company_id + '\\",\\"method\\":\\"GET\\"},{\\"relative_url\\":\\"company_data/summary/' + company_id + '\\",\\"method\\":\\"GET\\"}]"}'
            r = s.post('https://www.nordnet.se/api/2/batch', headers=headers, data=data)
            print(r.text)