Search code examples
pythonweb-scrapingiframemime

I am not able to Scrape Table content with MIME format of data:application/octet-stream using python


I am trying to scrape some data from website, but the data is contained in an Iframe. Initially I scraped the source link but from the source also I am not able to scrape the data. I need help how to extract the data from this source link. Here is the source link: https://chartviewer-europublic.bigapis.net/nzgaV/index.html

Also I am sharing the screenshot here showing the download button url of the data under "a" tag but I am not able to extract this link also.

enter image description here

Here is the code I have used. I have used BeautifulSoup for the scraping.

# Libraries

from bs4 import BeautifulSoup
import requests

# Original website link
url_spain_total="https://anfac.com/cifras-clave/matriculaciones-turismos-y-todoterreno/"

page_total=requests.get(url_spain_total).text

soup_spain_total=BeautifulSoup(page_total,"lxml")

print(soup_spain_total.prettify())

# Getting the list of links in the iframe
result_spain=soup_spain_total.find_all("iframe")
result_spain

# Getting the required source link
total_main_link=result_spain[1]["src"]
total_main_link

After getting the source link, I am not able to extract the table contents.

Any help is appreciated. Thanks in Advance!


Solution

  • The following is an example of how you can get that data using selenium:

    from selenium import webdriver
    from selenium.webdriver.chrome.service import Service
    from selenium.webdriver.chrome.options import Options
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    import pandas as pd
    
    chrome_options = Options()
    chrome_options.add_argument("--no-sandbox")
    # chrome_options.add_argument("--headless")
    chrome_options.add_argument('disable-notifications')
    chrome_options.add_argument("window-size=1920,1080")
    
    webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
    browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
    wait = WebDriverWait(browser, 20)
    url = ' https://chartviewer-europublic.bigapis.net/nzgaV/index.html'
    browser.get(url) 
    table = wait.until(EC.element_to_be_clickable((By.ID, "datatable")))
    df = pd.read_html(table.get_attribute("outerHTML"))[0]
    print(df)
    

    This will get the information as a dataframe, and display it in terminal:

    Categoría Ago-22 Ago-21 % Variacion Acumulado 2022 Acumulado 2021 % Variacion Acumulado
    0 Gasolina 22.3402 20.0702 11311.31 231.348 279.89 -17-17.34
    1 Diesel 8.9639 8.06481 11211.15 92.9799 119.641 -22-22.29
    2 Resto 20.6042 19.4492 595.94 208.715 188.782 1110.56
    3 Total combustibles 51.9075 47.5835 919.09 533.043 588.314 -9-9.39
    4 Particular 24.9512 26.0833 -4,3-4.34 233.413 236.728 -1-1.4
    5 Empresa 21.7122 17.6732 22922.85 224.337 215.654 44.03
    6 Alquiler 5.24452 3.82738 37037.03 75.2928 135.931 -45-44.61
    7 Total canales 51.9075 47.5835 919.09 533.043 588.314 -9-9.39

    The selenium setup is for linux. However, if you would just peruse the questions on Selenium on this forum, you would find countless examples of selenium/chromedriver setups for Windows, if you are using Windows (or for Mac, for that matter).

    Also, Selenium documentation is helpful: https://www.selenium.dev/documentation/webdriver/getting_started/