I have been working on a project where I want to gather the urls and then I could just import all the modules with the scraper classes and it should register all of them into the list.
I have currently done:
import sys
import tldextract
class Scraper:
scrapers = {}
def __init_subclass__(scraper_class):
Scraper.scrapers[scraper_class.url] = scraper_class # .url -> Unresolved attribute reference 'url' for class 'Scraper'
@classmethod
def for_url(cls, url):
k = tldextract.extract(url)
return scrapers[k.domain]() #<-- Unresolved reference 'scrapers'
class BBCScraper(Scraper):
url = 'bbc.co.uk'
def scrape(s):
print(s)
# FIXME Scrape the correct values for BBC
return "Yay works!"
url = 'https://www.bbc.co.uk/'
scraper = Scraper.for_url(url)
scraper.scrape("yay")
My currently problem right now is that I am not able to continue to execute the code as I am not able to return scrapers[k.domain]()
Output >>> NameError: name 'scrapers' is not defined
I wonder how I can pick up the correct class as for exaple if the URL is the bbc, it should g into the BBCScraper class and then we call the scrape which later on will return the values that has been scraped on that specific website
Do as you did in __init_subclass__
or use cls.scrapers
.
@classmethod
def for_url(cls, url):
k = tldextract.extract(url)
return Scraper.scrapers[k.domain]()
# or
return cls.scrapers[k.domain]()
As for the second issue