Search code examples
pythonpandasbeautifulsouppython-requestsurllib

Cannot find HTML elements with BeautifulSoup Python


I found a really nice code on the https://towardsdatascience.com/ website for web scraping and I'm trying to implement for my own use.

https://ingatlan.com/lista/elado+lakas+ii-ker?page=1 this is a hungarian real estate website. Firstly, I just want to grab the prices of the real estates but if I run my code I don't get any results, the number of items found is 0.

import urllib.request,sys,time
from bs4 import BeautifulSoup
import requests
import pandas as pd

pagesToGet= 1

upperframe=[]  
for page in range(1,pagesToGet+1):
    print('processing page :', page)
    url = 'https://ingatlan.com/lista/elado+lakas+ii-ker?page='+str(page)
    print(url)
    
    
    try:
        page=requests.get(url)                            
    
    except Exception as e:                                   
        error_type, error_obj, error_info = sys.exc_info()     
        print ('ERROR FOR LINK:',url)                          
        print (error_type, 'Line:', error_info.tb_lineno)     
        continue                                              
    time.sleep(2)   
    soup=BeautifulSoup(page.text,'html.parser')
    frame=[]
    links=soup.find_all('div',attrs={'class':'listing js-listing '})
    print(len(links))
    filename="NEWS.csv"
    f=open(filename,"w", encoding = 'utf-8')
    headers="Price\n"
    f.write(headers)
    
for j in links:
        Price = j.find("div",attrs={'class':'price'})
        frame.append((Price))
        upperframe.extend(frame)
f.close()
data=pd.DataFrame(upperframe, columns=['Price'])
data.head()

What can I ruin? There have been sites where it works, such as Myprotein, but there are places where it does not.


Solution

  • Here only the price has been taken as you only asked that

    without the User-Agent it give 403 error forbidden

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    
    start_url="https://ingatlan.com/lista/elado+lakas+ii-ker?page=1"
    page_data=requests.get(start_url, headers={'User-Agent': 'XYZ/3.0'})
    soup=BeautifulSoup(page_data.content,"html.parser")
    
    #for i in soup:  #i was first just checking http staus here 
        #print(i)    #without useragent i got 403 as response
        #print()
        
    Price=[]
    
    for job_tag in soup.find_all("div",class_="resultspage__content"):
        for job_tag2 in job_tag.find_all("div",class_="listing js-listing"):
            for job_tag3 in job_tag2.find_all("div",class_="price__container js-has-sqm-price-info-tooltip"): 
    
    
                price=job_tag3.find("div",class_="price")
                Price.append(price.text.strip())
                #print(Price)
    
    data=pd.DataFrame(Price,columns=["price"])
    print(data)
    

    output of pandas DataFrame

             price
    0    31.5 M Ft
    1    77.9 M Ft
    2      62 M Ft
    3   129.5 M Ft
    4     125 M Ft
    5    95.9 M Ft
    6    46.9 M Ft
    7    45.9 M Ft
    8    59.9 M Ft
    9     109 M Ft
    10     48 M Ft
    11     87 M Ft