Search code examples
pythonweb-scrapingbeautifulsouphtml-parsing

beautiful soup scrape multi page missing value on next page


Im using Beautifulsoup to scrape list of car name and price on multi page site. In one page it contains 40 data and the code is working correctly if only to scrape one page. When it comes to scrape multi page (this case i only scrape two page to check if the code is working properly), i found that there's always missing data at the beginning of next page (column 'price'), which makes the data is not properly align starting at data 41 onward.

some note on the data for price column: the price listed could be as it is ('ads_price_highlight') or it could be ('ads_price'), as a discounted price.

below are the code i create to parse multi page for this case. I still have no idea why i got this missing data on price column while the other column is correct.

from bs4 import BeautifulSoup
import pandas as pd
import requests
import numpy as np

from time import sleep
from random import randint

headers = {"Accept-Language": "en-US, en;q=0.5"}

car = []
price = []

pages = np.arange(1,3,1)

for page in pages:

  url = 'https://www.mudah.my/malaysia/cars-for-sale/perodua?o='+ str(page) +'&q=&so=1&th=1'
  page = requests.get(url, headers=headers)

  soup = BeautifulSoup(page.text, 'html.parser')
  car_list = soup.find_all('li', class_='listing_ads_params')

  sleep(randint(2,10))

  for container in car_list:
        cars = container.find('div', {'class':'top_params_col1'})
        if cars is not None:
            car.append(cars.find('h2', {'class': 'list_title'}).text)   

        prices2 = container.find('div', class_='ads_price_highlight')
        if prices2 is not None:
            price.append(prices2.text)

        prices = container.find('div', class_='ads_price')
        if prices is not None:
            price.append(prices.text)

df = pd.DataFrame(data = list(zip(car, price)),
                    columns = ['car', 'price'])

df.to_csv(r'carprice.csv', index = False)

Solution

  • There are two things going:

    1.) The standard html.parser doesn't parse this page well, use lxml or html5lib

    2.) The page has "dummy" ad listing between regular ads with class="honey-pot", so the script needs to take care of them.

    For example:

    import requests
    from bs4 import BeautifulSoup
    
    
    url = 'https://www.mudah.my/malaysia/cars-for-sale/perodua?o={page}&q=&so=1&th=1'
    headers = {"Accept-Language": "en-US, en;q=0.5"}
    
    for page in range(1, 3):
        soup = BeautifulSoup(requests.get(url.format(page=page), headers=headers).content, 'lxml')
    
        for title, price in zip(soup.select('#list-view-ads .list_ads:not(.honey-pot) .list_title'),
                                soup.select('#list-view-ads .list_ads:not(.honey-pot) div[class^="ads_price"]')):
            print('{:<60}{}'.format(title.get_text(strip=True), price.get_text(strip=True)))
    

    Prints:

    Ladies Owner/SE B.Kit-2008 Perodua MYVI 1.3 EZ (A)          RM 15 800
    Perodua MYVI 1.3 EZ (A) LIMETED EDITION                     RM 16 800
    Perodua MYVI 1.3 SX FACELIFT (M)                            RM 10 990
    Perodua VIVA 1.0 (A) ONE OWNER ACC FREE                     RM 9 800
    Perodua KELISA 1.0 SE EZS (A) Jaga Baik                     RM 13 990
    Perodua MYVI 1.3 EZi (A) PASSO RACY~17" RIMS                RM 22 990
    Perodua MYVI 1.3 (A) EZi tru 2007                           RM 14 800
    23k KM SUPER CARKING 2010 Perodua MYVI 1.3 EZ (A)           RM 16 800
    Perodua MYVI 1.3(M) SX 1 owner Ori mielage                  RM 10 800
    Perodua MYVI H/AV 1.5L (A) R3Bat3 2XXX                      RM 50 600
    Perodua ARUZ X 1.5L (A) R3BaT3 2XXX                         RM 72 600
    Perodua AXIA GXTRA R3BAT3 1XXX                              RM 35 300
    
    ...and so on.