Search code examples
pythonpython-3.xweb-scrapingreturn

Scraper collecting the content of first page only


I've written a scraper using python to scrape movie names from yiffy torrents. The webpage has traversed around 12 pages. If i run my crawler using print statement, it gives me all the results from all the pages. However, when I run the same using return then it gives me the content from the first page only and do not go on to the next page to process the rest. As I'm having a hard time understanding the behavior of return statement, if somebody points out where I'm going wrong and give me a workaround I would be very happy. Thanks in advance.

This is what I'm trying with (the full code):

import requests
from urllib.request import urljoin
from lxml.html import fromstring            

main_link = "https://www.yify-torrent.org/search/western/"

# film_storage = [] #I tried like this as well (keeping the list storage outside the function)

def get_links(link):
    root = fromstring(requests.get(link).text)
    film_storage = []
    for item in root.cssselect(".mv"):
        name = item.cssselect("h3 a")[0].text
        film_storage.append(name)
    return film_storage

    next_page = root.cssselect(".pager a:contains('Next')")[0].attrib['href'] if root.cssselect(".pager a:contains('Next')") else ""
    if next_page:
        full_link = urljoin(link,next_page)
        get_links(full_link)

if __name__ == '__main__':  
    items = get_links(main_link)
    for item in items:
        print(item)

But, when i do like below, i get all the results (pasted gist portion only):

def get_links(link):
    root = fromstring(requests.get(link).text)
    for item in root.cssselect(".mv"):
        name = item.cssselect("h3 a")[0].text
        print(name)            ## using print i get all the results from all the pages

    next_page = root.cssselect(".pager a:contains('Next')")[0].attrib['href'] if root.cssselect(".pager a:contains('Next')") else ""
    if next_page:
        full_link = urljoin(link,next_page)
        get_links(full_link)

Solution

  • Your return statement prematurely terminates your get_links() function. Meaning this part

    next_page = root.cssselect(".pager a:contains('Next')")[0].attrib['href'] if root.cssselect(".pager a:contains('Next')") else ""
        if next_page:
            full_link = urljoin(link,next_page)
            get_links(full_link)
    

    is never executed.

    Quickfix would be to put the return statement at the end of your function, but you have to make film_storage global(defined outside the get_links() function).

    Edit: Just realized, since you will be making your film_storage global, there is no need for the return statement.

    Your code in main would just look like this:

    get_links(main_link)
    for item in film_storage:
        print(item)