I want to create a dataframe from a BeautifulSoup Object -
import pandas as pd
from requests import get
from bs4 import BeautifulSoup
import re
# Fetch the web page
url = 'https://carbondale.craigslist.org/search/apa#search=1~gallery~0~0'
response = get(url) # link exlcudes posts with no picures
page = response.text
# Parse the HTML content
soup = BeautifulSoup(page, 'html.parser')
# Information I need
list_url = []
title = []
location = []
price = []
# I run the following
list_url = [a['href'] for a in soup.select('a[href^="https"]')]
title = [x.text for x in soup.find_all(class_="title")]
location = [x.text for x in soup.find_all(class_="location")]
price = [x.text for x in soup.find_all(class_="price")]
But the problem I am facing is that for some class (e.g., title or location), some elements are missing, So, while I try to create a data frame, it shows error because of None
value because all lists size are not equal. You can use the len()
function to check the size of the list. Actually, I want to include the word "None" for missing elements in a column in the dataframe.
You need to iterate over each listing in the page and add values one by one to list_url
, list_location
, list_title
and list_price
. If any one of these values is missing, then add a None to the corresponding list. Then you may create the DataFrame using lists.
To iterate over the list, I had to look at the how the rows were structured and noticed a li class="cl-static-search-result"
was being used. You can then iterate over this list to find the required values instead of using find_all
on the whole page which does not take into account the relation between items within a listing.
Try this:
import pandas as pd
from requests import get
from bs4 import BeautifulSoup
# Fetch the web page
url = 'https://carbondale.craigslist.org/search/apa#search=1~gallery~0~0'
response = get(url) # link exlcudes posts with no picures
page = response.text
# Parse the HTML content
soup = BeautifulSoup(page, 'html.parser')
# Extract listings from the page
listings = soup.find_all('li', class_='cl-static-search-result')
for listing in listings:
# Extract URL
url = listing.find('a')
list_url.append(url['href'] if url else None)
# Extract Title
title_text = listing.find('div', class_='title')
title.append(title_text.text if title_text else None)
# Extract Location
location_text = listing.find('div', class_='location')
location.append(location_text.text.strip() if location_text else None)
# Extract Price
price_text = listing.find('div', class_='price')
price.append(price_text.text if price_text else None)
# Create DataFrame
df = pd.DataFrame({
'URL': list_url,
'Title': title,
'Location': location,
'Price': price
})
Printing out the first 5 rows
URL Title Location Price
0 https://carbondale.craigslist.org/apa/d/johnst... Almost New, 2 BR APT for Rent in JC Johnston City $900
1 https://carbondale.craigslist.org/apa/d/northb... Enjoy 2 Bed/2 Bath/2 Car Ranch Home With Great... Northbrook, IL $945
2 https://carbondale.craigslist.org/apa/d/mount-... Love where you live! Beautiful senior communit... Mount Vernon $848
3 https://carbondale.craigslist.org/apa/d/mount-... Your dream 1 bed, 1 bath is closer than you th... Mount Vernon $848
4 https://carbondale.craigslist.org/apa/d/mount-... Be at the center of it all: 1 BR, 1 BA, 553 Sq... Mount Vernon $848
For any listing, since we append a None if the value is not available or if we cannot find it. The dataframe with missing values would like this:
URL Title Location Price
16 https://carbondale.craigslist.org/apa/d/marion... Cfd housing Houses for rent or purchase southe... None $500
17 https://carbondale.craigslist.org/apa/d/herrin... 4 Bedroom/2 bathroom House None $1,561