Search code examples
pythonhtmlloopsbeautifulsoupurllib

Iterate through a list of urls for web scraping with python using beautifulsoup (unknown url type)


I'm trying to scrape the content of each url from a list that I have, there's no problem with that, my list is working fine,

The original link is this: https://www.lamudi.com.mx/nuevo-leon/departamento/for-rent/

tags = soup('a',{'class':'js-listing-link'})
    for tag in tags:
        linktag = tag.get('href').strip()
        if linktag not in linklist:
            linklist.append(linktag)

The result of the above is a list of urls as strings. But then I try this:

for link in linklist[0]:
    page2=urllib.request.Request(link,headers={'User-Agent': 'Mozilla/5.0'})
    myhtml2 = urllib.request.urlopen(page2).read()
    soupfl = BeautifulSoup(myhtml2, 'html.parser')

just for proving that all is working, but I get an error:

raise ValueError("unknown url type: %r" % self.full_url)

ValueError: unknown url type: 'h'


Solution

  • To get all links you can use this example:

    import urllib.request
    from bs4 import BeautifulSoup
    
    
    URL = "https://www.lamudi.com.mx/nuevo-leon/departamento/for-rent/"
    
    HEADERS = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"
    }
    
    r = urllib.request.Request(URL, headers=HEADERS)
    soup = BeautifulSoup(urllib.request.urlopen(r).read(), "html.parser")
    
    tags = soup.find_all("a", {"class": "js-listing-link"})
    
    links = []
    [links.append(link["href"]) for link in tags if link["href"] not in links]
    
    for link in links:
        print("Getting:", link)
        r2 = urllib.request.Request(link, headers=HEADERS)
        soup2 = BeautifulSoup(urllib.request.urlopen(r2).read(), "html.parser")