Search code examples
pythonpython-3.xweb-scrapinggetattr

Can't parse bs4 src attribute using the getattr() function


I've created a script to parse two fields from every movie container from a webpage. The script is doing fine.

I'm trying to use this getattr() function to scrape text and src from two fields, as in movie_name and image_link. In case of movie_name, it works. However, it fails when I try to parse image_link.

There is a function currently commented out which works when I uncomment. However, my goal here is to make use of getattr() to parse src.

import requests
from bs4 import BeautifulSoup

url = "https://yts.am/browse-movies"
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}

# def get_information(url):
#     res = requests.get(url,headers=headers)
#     soup = BeautifulSoup(res.text,'lxml')
#     for row in soup.select(".browse-movie-wrap"):
#         movie_name = row.select_one("a.browse-movie-title").text
#         image_link = row.select_one("img.img-responsive").get("src")
#         yield movie_name,image_link

def get_information(url):
    res = requests.get(url,headers=headers)
    soup = BeautifulSoup(res.text,'lxml')
    for row in soup.select(".browse-movie-wrap"):
        movie_name = getattr(row.select_one("a.browse-movie-title"),"text",None)
        image_link = getattr(row.select_one("img.img-responsive"),"src",None)
        yield movie_name,image_link

if __name__ == '__main__':
    for items in get_information(url):
        print(items)

How can I scrape src using getattr() function?


Solution

  • The reason this works:

    movie_name = getattr(row.select_one("a.browse-movie-title"),"text",None)
    

    But this doesn't:

    image_link = getattr(row.select_one("img.img-responsive"),"src",None)
    

    is because methods of a class are also attributes. So, effectively, what you're doing is getting a function text for the first example. In other words, there's no method or attribute called src.

    If you look at attributes of:

    row.select_one("a.browse-movie-title").attrs
    

    You'll get:

    {'href': 'https://yts.mx/movies/imperial-blue-2019', 'class': ['browse-movie-title']}
    

    Likewise, for

    row.select_one(".img-responsive").attrs
    

    The output is:

    {'class': ['img-responsive'], 'src': 'https://img.yts.mx/assets/images/movies/imperial_blue_2019/medium-cover.jpg', 'alt': 'Imperial Blue (2019) download', 'width': '170', 'height': '255'}
    

    So, if we experiment and do this:

    getattr(row.select_one(".img-responsive"), "attrs", None).src
    

    We'll end up with:

    AttributeError: 'dict' object has no attribute 'src'
    

    Therefore, as mentioned in the comments, this is not how you'd use getattr() in pure Python sense on bs4 objects. You can either use the .get() method or the [key] syntax.

    For example:

    import requests
    from bs4 import BeautifulSoup
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    }
    
    
    def get_information(url):
        soup = BeautifulSoup(requests.get(url, headers=headers).text, 'lxml')
        for row in soup.select(".browse-movie-wrap"):
            movie_name = row.select_one("a.browse-movie-title").getText()
            image_link = row.select_one("img.img-responsive").get("src")
            yield movie_name, image_link
    
    
    if __name__ == '__main__':
        for items in get_information("https://yts.am/browse-movies"):
            print(items)
    

    This produces:

    ('Imperial Blue', 'https://img.yts.mx/assets/images/movies/imperial_blue_2019/medium-cover.jpg')
    ('Ablaze', 'https://img.yts.mx/assets/images/movies/ablaze_2001/medium-cover.jpg')
    ('[CN] Long feng zhi duo xing', 'https://img.yts.mx/assets/images/movies/long_feng_zhi_duo_xing_1984/medium-cover.jpg')
    ('Bobbie Jo and the Outlaw', 'https://img.yts.mx/assets/images/movies/bobbie_jo_and_the_outlaw_1976/medium-cover.jpg')
    ('Adam Resurrected', 'https://img.yts.mx/assets/images/movies/adam_resurrected_2008/medium-cover.jpg')
    ('[ZH] The Wasted Times', 'https://img.yts.mx/assets/images/movies/the_wasted_times_2016/medium-cover.jpg')
    ('Promise', 'https://img.yts.mx/assets/images/movies/promise_2021/medium-cover.jpg')
    
    and so on ...
    

    Finally, if you really want to parse this with getattr() you can try this:

    movie_name = getattr(row.select_one("a.browse-movie-title"), "getText", None)()
    image_link = getattr(row.select_one("img.img-responsive"), "attrs", None)["src"]
    

    And you'll still get the same results, but, IMHO, this is way too complicated and not too readable either than a plain .getText() and .get("src") syntax.