I've created a script to parse two fields from every movie container from a webpage. The script is doing fine.
I'm trying to use this getattr()
function to scrape text and src from two fields, as in movie_name
and image_link
. In case of movie_name
, it works. However, it fails when I try to parse image_link
.
There is a function currently commented out which works when I uncomment. However, my goal here is to make use of getattr()
to parse src
.
import requests
from bs4 import BeautifulSoup
url = "https://yts.am/browse-movies"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}
# def get_information(url):
# res = requests.get(url,headers=headers)
# soup = BeautifulSoup(res.text,'lxml')
# for row in soup.select(".browse-movie-wrap"):
# movie_name = row.select_one("a.browse-movie-title").text
# image_link = row.select_one("img.img-responsive").get("src")
# yield movie_name,image_link
def get_information(url):
res = requests.get(url,headers=headers)
soup = BeautifulSoup(res.text,'lxml')
for row in soup.select(".browse-movie-wrap"):
movie_name = getattr(row.select_one("a.browse-movie-title"),"text",None)
image_link = getattr(row.select_one("img.img-responsive"),"src",None)
yield movie_name,image_link
if __name__ == '__main__':
for items in get_information(url):
print(items)
How can I scrape
src
usinggetattr()
function?
The reason this works:
movie_name = getattr(row.select_one("a.browse-movie-title"),"text",None)
But this doesn't:
image_link = getattr(row.select_one("img.img-responsive"),"src",None)
is because methods of a class are also attributes. So, effectively, what you're doing is getting a function text for the first example. In other words, there's no method or attribute called src
.
If you look at attributes
of:
row.select_one("a.browse-movie-title").attrs
You'll get:
{'href': 'https://yts.mx/movies/imperial-blue-2019', 'class': ['browse-movie-title']}
Likewise, for
row.select_one(".img-responsive").attrs
The output is:
{'class': ['img-responsive'], 'src': 'https://img.yts.mx/assets/images/movies/imperial_blue_2019/medium-cover.jpg', 'alt': 'Imperial Blue (2019) download', 'width': '170', 'height': '255'}
So, if we experiment and do this:
getattr(row.select_one(".img-responsive"), "attrs", None).src
We'll end up with:
AttributeError: 'dict' object has no attribute 'src'
Therefore, as mentioned in the comments, this is not how you'd use getattr()
in pure Python sense on bs4
objects. You can either use the .get()
method or the [key]
syntax.
For example:
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}
def get_information(url):
soup = BeautifulSoup(requests.get(url, headers=headers).text, 'lxml')
for row in soup.select(".browse-movie-wrap"):
movie_name = row.select_one("a.browse-movie-title").getText()
image_link = row.select_one("img.img-responsive").get("src")
yield movie_name, image_link
if __name__ == '__main__':
for items in get_information("https://yts.am/browse-movies"):
print(items)
This produces:
('Imperial Blue', 'https://img.yts.mx/assets/images/movies/imperial_blue_2019/medium-cover.jpg')
('Ablaze', 'https://img.yts.mx/assets/images/movies/ablaze_2001/medium-cover.jpg')
('[CN] Long feng zhi duo xing', 'https://img.yts.mx/assets/images/movies/long_feng_zhi_duo_xing_1984/medium-cover.jpg')
('Bobbie Jo and the Outlaw', 'https://img.yts.mx/assets/images/movies/bobbie_jo_and_the_outlaw_1976/medium-cover.jpg')
('Adam Resurrected', 'https://img.yts.mx/assets/images/movies/adam_resurrected_2008/medium-cover.jpg')
('[ZH] The Wasted Times', 'https://img.yts.mx/assets/images/movies/the_wasted_times_2016/medium-cover.jpg')
('Promise', 'https://img.yts.mx/assets/images/movies/promise_2021/medium-cover.jpg')
and so on ...
Finally, if you really want to parse this with getattr()
you can try this:
movie_name = getattr(row.select_one("a.browse-movie-title"), "getText", None)()
image_link = getattr(row.select_one("img.img-responsive"), "attrs", None)["src"]
And you'll still get the same results, but, IMHO, this is way too complicated and not too readable either than a plain .getText()
and .get("src")
syntax.