Search code examples
pythonparsingbeautifulsouphtml-parsing

Issues with Python BeautifulSoup parsing


I am trying to parse an html page with BeautifulSoup. The task is to get the data underlined with red color for all the lots on this page. enter image description here I got the data from the left and the right block (about the lot, auction name, country etc) but getting the data from the central block seems to be problematic for me. Here is the example of what is done.

import requests
import re
from bs4 import BeautifulSoup as bs
import pandas as pd

URL_TEMPLATE = "https://www.artprice.com/artist/15079/wassily-kandinsky/lots/pasts?ipp=100"
FILE_NAME = "test"

def parse(url = URL_TEMPLATE):
    result_list = {'lot': [], 'name': [], 'date': [], 'type1': [], 'type2': [], 'width': [], 'height': [], 'estimate': [], 'hummerprice': [], 'auction_date': [], 'auction': [], 'country': []}
    r = requests.get(URL_TEMPLATE)
    soup = bs(r.text, "html.parser")
    lot_info = soup.find_all('p', class_='hidden-xs')
    date_info = soup.find_all('date')
    names_info = soup.find_all('a', class_='sln_lot_show')
    auction_info = soup.find_all('p', class_='visible-xs')
    auction_date_info = soup.find_all(string=re.compile('\d\d\s\w\w\w\s\d\d\d\d'))[1::2]
    type1_info = soup.find_all('div')
    for i in range(len(lot_info)):
        result_list['lot'].append(lot_info[i].text)
    for i in range(len(date_info)):
        result_list['date'].append(date_info[i].text)
    for i in range (len(names_info)):
        result_list['name'].append(names_info[i].text)
    for i in range(0, len(auction_info), 2):
        result_list['auction'].append(soup.find_all('p', class_='visible-xs')[i].strong.string)
    for i in range(1, len(auction_info), 2):
        result_list['country'].append(soup.find_all('p', class_='visible-xs')[i].string)
    for i in range(len(auction_date_info)):
        result_list['auction_date'].append(auction_date_info[i])
    return result_list
df = pd.DataFrame(data=parse())
df.to_excel("test.xlsx")

So, the task is to get the data from the central block separately for each lot on this page.


Solution

  • You need nth-of-type to access all those <p> elements.

    This does it for just the first one to show that it works.
    I'll leave it to you to clean up the output.

    for div in soup.find_all('div',class_='col-xs-8 col-sm-6'): 
        print(div.select_one('a').text.strip()) 
        print(div.select_one('p:nth-of-type(2)').text.strip()) 
        print(div.select_one('p:nth-of-type(3)').text.strip()) 
        print(div.select_one('p:nth-of-type(4)').text.strip()) 
        break 
    

    Result:

    Abstract
    Print-Multiple, Print in colors, 29 1/2 x 31 1/2 in75 x 80 cm
    Estimate:
    
                  € 560 - € 784
    
    
                  $ 605 - $ 848
    
    
                  £ 500 - £ 700
    
    
                  ¥ 4,303 - ¥ 6,025
    Hammer price:
                  not communicated
    not communicated
    not communicated
    not communicated