I found a really nice code on the https://towardsdatascience.com/ website for web scraping and I'm trying to implement for my own use.
https://ingatlan.com/lista/elado+lakas+ii-ker?page=1 this is a hungarian real estate website. Firstly, I just want to grab the prices of the real estates but if I run my code I don't get any results, the number of items found is 0.
import urllib.request,sys,time
from bs4 import BeautifulSoup
import requests
import pandas as pd
pagesToGet= 1
upperframe=[]
for page in range(1,pagesToGet+1):
print('processing page :', page)
url = 'https://ingatlan.com/lista/elado+lakas+ii-ker?page='+str(page)
print(url)
try:
page=requests.get(url)
except Exception as e:
error_type, error_obj, error_info = sys.exc_info()
print ('ERROR FOR LINK:',url)
print (error_type, 'Line:', error_info.tb_lineno)
continue
time.sleep(2)
soup=BeautifulSoup(page.text,'html.parser')
frame=[]
links=soup.find_all('div',attrs={'class':'listing js-listing '})
print(len(links))
filename="NEWS.csv"
f=open(filename,"w", encoding = 'utf-8')
headers="Price\n"
f.write(headers)
for j in links:
Price = j.find("div",attrs={'class':'price'})
frame.append((Price))
upperframe.extend(frame)
f.close()
data=pd.DataFrame(upperframe, columns=['Price'])
data.head()
What can I ruin? There have been sites where it works, such as Myprotein, but there are places where it does not.
Here only the price has been taken as you only asked that
without the User-Agent it give 403 error forbidden
import requests
from bs4 import BeautifulSoup
import pandas as pd
start_url="https://ingatlan.com/lista/elado+lakas+ii-ker?page=1"
page_data=requests.get(start_url, headers={'User-Agent': 'XYZ/3.0'})
soup=BeautifulSoup(page_data.content,"html.parser")
#for i in soup: #i was first just checking http staus here
#print(i) #without useragent i got 403 as response
#print()
Price=[]
for job_tag in soup.find_all("div",class_="resultspage__content"):
for job_tag2 in job_tag.find_all("div",class_="listing js-listing"):
for job_tag3 in job_tag2.find_all("div",class_="price__container js-has-sqm-price-info-tooltip"):
price=job_tag3.find("div",class_="price")
Price.append(price.text.strip())
#print(Price)
data=pd.DataFrame(Price,columns=["price"])
print(data)
output of pandas DataFrame
price
0 31.5 M Ft
1 77.9 M Ft
2 62 M Ft
3 129.5 M Ft
4 125 M Ft
5 95.9 M Ft
6 46.9 M Ft
7 45.9 M Ft
8 59.9 M Ft
9 109 M Ft
10 48 M Ft
11 87 M Ft