Search code examples
pythonbeautifulsoupdata-extraction

Data scrape Python beautifulsoup code do not loop


I am trying to scrape data. Somehow the loop doesn't work correctly. It loops just once. I want to scrape all the name of the goods and the price.

The goods are inside "td" eg : "Sendok Semen 7 Bulat" and the price are inside "div" eg : "8.500"

Here is my code :

import requests
from bs4 import BeautifulSoup
url = 'https://www.ralali.com/search/semen'
res = requests.get(url)
html = BeautifulSoup(res.content,"html.parser")
#divs = html.find_all('div', class_ = "col-md-12 col-xs-12") 
divs = html.findAll('div', class_ = "row d-block")
cnt = 0

for div in divs:
  cnt += 1
  #print(div, end="\n"*2)
  price = div.find('span', class_ = 'float-right')
  print(price.text.strip())
  print(cnt)

Any help will be appreciated. Thanks


Solution

  • What happens?

    Somehow the loop doesn't work correctly. It loops just once.

    It is not the loop that won't work correctly, it is rather the way you are selecting things. So html.findAll('div', class_ = "row d-block") will find only one <div> that matches your criteria.

    How to fix?

    Make you are selecting more specific, cause what you are really want to iterate are the <tr> in the table - I often use css selectors and the following will get the correct selection, so just replace your html.findAll('div', class_ = "row d-block") Note In new code use find_all() instead of findAll() it is the newer syntax:

    html.select('.d-block tbody tr')
     
    

    Example

    Will give you a well structured list of dicts:

    import requests
    from bs4 import BeautifulSoup
    url = 'https://www.ralali.com/search/semen'
    res = requests.get(url)
    html = BeautifulSoup(res.content,"html.parser")
    
    data = []
    for row in html.select('.d-block tbody tr'):
        data.append(
            dict(
                zip(['pos','name','currency','price'],list(row.stripped_strings))
            )
        )
    data
    

    Output

    [{'pos': '1',
      'name': 'Sendok Semen 7 Bulat',
      'currency': 'Rp',
      'price': '8.500'},
     {'pos': '2',
      'name': 'Sendok Semen 8 Bulat Gagang Kayu',
      'currency': 'Rp',
      'price': '10.000'},
     {'pos': '3', 'name': 'SEMEN', 'currency': 'Rp', 'price': '10.000'},
     {'pos': '4',
      'name': 'Sendok Semen 8 Gagang Kayu SWARDFISH',
      'currency': 'Rp',
      'price': '10.000'},...]
    

    But Be Aware

    It will just help you to get the Top 10 - List Of Popular Semen Prices In Ralali and not all goods and prices on the page --> That is something you should clarify in your question.

    Getting more data from all products

    Option#1

    Use an api that is provided by the website and iterate by parameter pages:

    import requests
    
    url = 'https://rarasearch.ralali.com/v2/search/item?q=semen'
    res = requests.get(url)
    
    data = []
    
    for p in range(1, round(res.json()['total_item']/20)):
        url = f'https://rarasearch.ralali.com/v2/search/item?q=semen&p={p}'
        res = requests.get(url)
        data.extend(res.json()['items'])
    
    print(data)
    

    Output:

    [{'id': 114797,
      'name': 'TIGA RODA Semen NON semen putih',
      'image': 'assets/img/Libraries/114797_TIGA_RODA_Semen_NON_semen_putih_1_UrwztohXHo9u1yRY_1625473149.png',
      'alias': 'tiga-roda-semen-non-semen-putih-157561001',
      'vendor_id': 21156,
      'vendor_alias': 'prokonstruksi',
      'rating': '5.00',
      'vendor_status': 'A',
      'vendor_name': 'Pro Konstruksi',
      'vendor_location': 'Palembang',
      'price': '101500.00',
      'discount': 0,
      'discount_percentage': 0,
      'free_ongkir_lokal': 0,
      'free_ongkir_nusantara': 1,
      'is_stock_available': 1,
      'minimum_order': 1,
      'maximum_order': 999999999,
      'unit_type': 'unit',
      'ss_type': 0,
      'is_open': 'Y',
      'wholesale_price': []},
     {'id': 268711,
      'name': 'Sendok Semen Ukuran 6',
      'image': 'assets/img/Libraries/268711_Sendok-Semen-Ukuran-6_HCLcQq6TUh5IiEPZ_1553521818.jpeg',
      'alias': 'Sendok-Semen-Ukuran-6',
      'vendor_id': 305459,
      'vendor_alias': 'distributorbangunan',
      'rating': None,
      'vendor_status': 'A',
      'vendor_name': 'Distributor Bangunan',
      'vendor_location': 'Bandung',
      'price': '11000.00',
      'discount': 0,
      'discount_percentage': 0,
      'free_ongkir_lokal': 0,
      'free_ongkir_nusantara': 0,
      'is_stock_available': 1,
      'minimum_order': 1,
      'maximum_order': 999999999,
      'unit_type': 'Unit',
      'ss_type': 0,
      'is_open': 'Y',
      'wholesale_price': []},...]
    

    Option#2

    Use selenium, scroll to the bottom of the page toa load all products, push the driver.page_source to your soup and start selecting, ...