Search code examples
pythonhtmlpython-3.xweb-scrapingbeautifulsoup

Is there a way to extract CSS from a webpage using BeautifulSoup?


I am working on a project which requires me to view a webpage, but to use the HTML further, I have to see it fully and not as a bunch of lines mixed in with pictures. Is there a way to parse the CSS along with the HTML using BeautifulSoup?

Here is my code:

from bs4 import BeautifulSoup


def get_html(url, name):
    r = requests.get(url)
    r.encoding = 'utf8'
    return r.text


link = 'https://www.labirint.ru/books/255282/'
with open('labirint.html', 'w', encoding='utf-8') as file:
    file.write(get_html(link, '255282'))

WARNING: The page: https://www.labirint.ru/books/255282/ has a redirect to https://www.labirint.ru/books/733371/.


Solution

  • If your goal is to truly parse the css:

    Beautiful soup will pull the entire page - and it does include the header, styles, scripts, linked in css and js, etc. I have used the method in the pythonCodeArticle before and retested it for the link you provided.

    import requests
    from bs4 import BeautifulSoup as bs
    from urllib.parse import urljoin
    
    # URL of the web page you want to extract
    url = "ENTER YOUR LINK HERE"
    
    # initialize a session & set User-Agent as a regular browser
    session = requests.Session()
    session.headers["User-Agent"] = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
    
    # get the HTML content
    html = session.get(url).content
    
    # parse HTML using beautiful soup
    soup = bs(html, "html.parser")
    print(soup)
    

    By looking at the soup output (It is very long, I will not paste here).. you can see it is a complete page. Just make sure to paste in your specific link

    NOW If you wanted to parse the result to pick up all css urls.... you can add this: (I am still using parts of the code from the very well described python Code article link above)

    # get the CSS files
    css_files = []
    for css in soup.find_all("link"):
        if css.attrs.get("href"):
            # if the link tag has the 'href' attribute
            css_url = urljoin(url, css.attrs.get("href"))
            css_files.append(css_url)
    print(css_files)
    

    The output css_files will be a list of all css files. You can now go visit those separately and see the styles that are being imported.

    NOTE:this particular site has a mix of styles inline with the html (i.e. they did not always use css to set the style properties... sometimes the styles are inside the html content.)

    This should get you started.