Search code examples
pythonhtmlweb-scrapingframe

python requests only returning empty sets when scraping


It's my first try doing programming. I'm trying to scrape some words by scraping using bs4, selenium etc... The site I use is 'http://oulim.kr'

How can I scrape things inside the frameset?

this is what i have tried

import urllib
from bs4 import BeautifulSoup
from selenium import webdriver

url = 'http://oulim.kr/'

driver = webdriver.Chrome('./driver/chromedriver')
driver.get(url)

html = driver.page_source
soup = BeautifulSoup(html)

a = soup.select("#divAlba > table:nth-child(3) > tbody > tr:nth-child(2) > td:nth-child(5) > a > font > b")
print(a)

from requests_html import HTMLSession
session = HTMLSession()
r = session.get('http://oulim.kr')
r.html.find('.tbody')

Solution

  • Selenium treats frames as separated pages (because it has to load it separatelly) and it doesn't search in frames. And page_source doesn't return HTML from frame.

    You have to find <frame> and switch to correct frame switch_to.frame(..) to work with it.

    frames = driver.find_elements_by_tag_name('frame')
    driver.switch_to.frame(frames[0])
    

    import urllib
    from bs4 import BeautifulSoup
    from selenium import webdriver
    
    url = 'http://oulim.kr/'
    
    driver = webdriver.Chrome('./driver/chromedriver')
    driver.get(url)
    
    # --- switch frame ---
    
    frames = driver.find_elements_by_tag_name('frame')
    driver.switch_to.frame(frames[0])
    
    # --- CSS without BeautifulSoup ---
    
    a = driver.find_element_by_css_selector("#divAlba > table:nth-child(3) > tbody > tr:nth-child(2) > td:nth-child(5) > a > font > b")
    print(a.text)
    
    # --- CSS with BeautifulSoup ---
    
    html = driver.page_source
    soup = BeautifulSoup(html)
    
    a = soup.select("#divAlba > table:nth-child(3) > tbody > tr:nth-child(2) > td:nth-child(5) > a > font > b")
    print(a[0].text)