I've created a script using concurrent.futures
library to print the result from fetch_links
function. When I use print
statement inside the function, I get the results accordingly. What I wish to do now is print the result from that function using yield statement.
Is there any way I can modify things under main
function in order to print the result from fetch_links
function keeping it as is, meaning keeping the yield statement?
import requests
from bs4 import BeautifulSoup
import concurrent.futures as cf
links = [
"https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page=2&pagesize=50",
"https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page=3&pagesize=50",
"https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page=4&pagesize=50"
]
base = 'https://stackoverflow.com{}'
def fetch_links(s,link):
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
for item in soup.select(".summary .question-hyperlink"):
# print(base.format(item.get("href")))
yield base.format(item.get("href"))
if __name__ == '__main__':
with requests.Session() as s:
with cf.ThreadPoolExecutor(max_workers=5) as exe:
future_to_url = {exe.submit(fetch_links,s,url): url for url in links}
cf.as_completed(future_to_url)
Your fetch_links
is a generator, so you have to loop over that too, to get the results:
import requests
from bs4 import BeautifulSoup
import concurrent.futures as cf
links = [
"https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page=2&pagesize=50",
"https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page=3&pagesize=50",
"https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page=4&pagesize=50"
]
base = 'https://stackoverflow.com{}'
def fetch_links(s, link):
r = s.get(link)
soup = BeautifulSoup(r.text, "lxml")
for item in soup.select(".summary .question-hyperlink"):
yield base.format(item.get("href"))
if __name__ == '__main__':
with requests.Session() as s:
with cf.ThreadPoolExecutor(max_workers=5) as exe:
future_to_url = {exe.submit(fetch_links, s, url): url for url in links}
for future in cf.as_completed(future_to_url):
for result in future.result():
print(result)
Output:
https://stackoverflow.com/questions/64298886/rvest-webscraping-in-r-with-form-inputs
https://stackoverflow.com/questions/64298879/is-this-site-not-suited-for-web-scraping-using-beautifulsoup
https://stackoverflow.com/questions/64297907/python-3-extract-html-data-from-sports-site
https://stackoverflow.com/questions/64297728/cant-get-the-fully-loaded-html-for-a-page-using-puppeteer
https://stackoverflow.com/questions/64296859/scrape-text-from-a-span-tag-containing-nested-span-tag-in-beautifulsoup
https://stackoverflow.com/questions/64296656/scrapy-nameerror-name-items-is-not-defined
https://stackoverflow.com/questions/64296201/missing-values-while-scraping-using-beautifulsoup-in-python
https://stackoverflow.com/questions/64296130/how-can-i-identify-the-element-containing-the-link-to-my-linkedin-profile-after
https://stackoverflow.com/questions/64295959/why-use-scrapy-or-beautifulsoup-vs-just-parsing-html-with-regex-v2
https://stackoverflow.com/questions/64295842/how-to-retreive-scrapping-data-from-web-to-json-like-format
https://stackoverflow.com/questions/64295559/how-to-iterate-through-a-supermarket-website-and-getting-the-product-name-and-pr
https://stackoverflow.com/questions/64295509/cant-stop-asyncio-request-for-some-delay
https://stackoverflow.com/questions/64295244/paginate-with-network-requests-scraper
and so on ...