I'm scraping some information and below is my code
from bs4 import BeautifulSoup
import requests
url = "https://www.privateproperty.com.ng/property-for-sale"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find_all('div', class_="similar-listings-item sponsored-listing")
for result in results:
Title = result.find('div', class_= "similar-listings-info").text.replace('\n','')
location = result.find( class_= "listings-location").text.replace('\n','')
Price = result.find('div', class_= "similar-listings-price").text.replace('\n','')
info = (Title, location, Price)
print(info)
Why does this line
results = soup.find_all('div', class_="similar-listings-item sponsored-listing")
return only the 1st element?
Why does this line
results = soup.find_all('div', class_="similar-listings-item sponsored-listing")
return only the 1st element?
I'm getting 2 elements, but maybe you're only seeing the last result because the info=...print(info)
lines are after the loop instead of inside it. Indent them to print every the result from inside the loop.
If your issue is that you want all the listings, you should note that only the sponsored listings have the sponsored-listing
class. To get all the listings, you can try using
results = soup.find_all('div', {'class': "similar-listings-item"}) ## OR
# results = soup.select('div.similar-listings-item')
[Use soup.select('div.similar-listings-item:not(.sponsored-listing)')
if you only want unsponsored listings. Check out how to use .select
with CSS selectors for more details.]
I want to extract list of lists from the (variable)
which variable? If you want list of all the Title
, location
, Price
for each result
, initiate an empty list [like infoList
] before the loop, then indent info=...
to include it in the list, and append info
to infoList
at the end of the loop (but still in the loop). Something like
infoList = []
for result in results:
Title = result.find('div', class_= "similar-listings-info").text.replace('\n','')
location = result.find( class_= "listings-location").text.replace('\n','')
Price = result.find('div', class_= "similar-listings-price").text.replace('\n','')
info = (Title, location, Price) # this is a tuple btw, so
# infoList.append(info) # --> list of tuples
infoList.append([Title, location, Price]) # --> list of lists
# print(info) # will print for every result
print(info) # will print ONLY the LAST result
Btw, it's not very safe to chain .find
and .text
like that. If .find
doesn't find any thing, then an error will be raised when trying to get .text
. To be more cautious, you should check that find returned something first.
You could use my selectForList
function like infoList = [selectForList(result, ['div.similar-listings-info', 'p.listings-location', 'div.similar-listings-price']) for result in results]
or [since you want to remove the \n
s and also if you don't want to use CSS selectors] use a variation of it:
def get_min_text(containerTag, elName, classAttr, defaultVal=None):
el = containerTag.find(elName, class_=classAttr)
if el is None: return defaultVal
return ' '.join(el.get_text(' ').split()) # split+join minimizes whitespace
results = soup.find_all('div', {'class': "similar-listings-item"})
infoList = [[get_min_text(result, *c[:3]) for c in [
('div', 'similar-listings-info'), # Title
('p', 'listings-location'), # Location
('div', 'similar-listings-price') # Price
]] for result in results]