I've written a scraper using python to scrape movie names from yiffy torrents. The webpage has traversed around 12 pages. If i run my crawler using print
statement, it gives me all the results from all the pages. However, when I run the same using return
then it gives me the content from the first page only and do not go on to the next page to process the rest. As I'm having a hard time understanding the behavior of return statement, if somebody points out where I'm going wrong and give me a workaround I would be very happy. Thanks in advance.
This is what I'm trying with (the full code):
import requests
from urllib.request import urljoin
from lxml.html import fromstring
main_link = "https://www.yify-torrent.org/search/western/"
# film_storage = [] #I tried like this as well (keeping the list storage outside the function)
def get_links(link):
root = fromstring(requests.get(link).text)
film_storage = []
for item in root.cssselect(".mv"):
name = item.cssselect("h3 a")[0].text
film_storage.append(name)
return film_storage
next_page = root.cssselect(".pager a:contains('Next')")[0].attrib['href'] if root.cssselect(".pager a:contains('Next')") else ""
if next_page:
full_link = urljoin(link,next_page)
get_links(full_link)
if __name__ == '__main__':
items = get_links(main_link)
for item in items:
print(item)
But, when i do like below, i get all the results (pasted gist portion only):
def get_links(link):
root = fromstring(requests.get(link).text)
for item in root.cssselect(".mv"):
name = item.cssselect("h3 a")[0].text
print(name) ## using print i get all the results from all the pages
next_page = root.cssselect(".pager a:contains('Next')")[0].attrib['href'] if root.cssselect(".pager a:contains('Next')") else ""
if next_page:
full_link = urljoin(link,next_page)
get_links(full_link)
Your return statement prematurely terminates your get_links() function. Meaning this part
next_page = root.cssselect(".pager a:contains('Next')")[0].attrib['href'] if root.cssselect(".pager a:contains('Next')") else ""
if next_page:
full_link = urljoin(link,next_page)
get_links(full_link)
is never executed.
Quickfix would be to put the return statement at the end of your function, but you have to make film_storage global(defined outside the get_links() function).
Edit: Just realized, since you will be making your film_storage global, there is no need for the return statement.
Your code in main would just look like this:
get_links(main_link)
for item in film_storage:
print(item)