I am trying to scrape data. Somehow the loop doesn't work correctly. It loops just once. I want to scrape all the name of the goods and the price.
The goods are inside "td" eg : "Sendok Semen 7 Bulat" and the price are inside "div" eg : "8.500"
Here is my code :
import requests
from bs4 import BeautifulSoup
url = 'https://www.ralali.com/search/semen'
res = requests.get(url)
html = BeautifulSoup(res.content,"html.parser")
#divs = html.find_all('div', class_ = "col-md-12 col-xs-12")
divs = html.findAll('div', class_ = "row d-block")
cnt = 0
for div in divs:
cnt += 1
#print(div, end="\n"*2)
price = div.find('span', class_ = 'float-right')
print(price.text.strip())
print(cnt)
Any help will be appreciated. Thanks
Somehow the loop doesn't work correctly. It loops just once.
It is not the loop that won't work correctly, it is rather the way you are selecting things. So html.findAll('div', class_ = "row d-block")
will find only one <div>
that matches your criteria.
Make you are selecting more specific, cause what you are really want to iterate are the <tr>
in the table - I often use css selectors
and the following will get the correct selection, so just replace your html.findAll('div', class_ = "row d-block")
Note In new code use find_all()
instead of findAll()
it is the newer syntax:
html.select('.d-block tbody tr')
Will give you a well structured list of dicts:
import requests
from bs4 import BeautifulSoup
url = 'https://www.ralali.com/search/semen'
res = requests.get(url)
html = BeautifulSoup(res.content,"html.parser")
data = []
for row in html.select('.d-block tbody tr'):
data.append(
dict(
zip(['pos','name','currency','price'],list(row.stripped_strings))
)
)
data
[{'pos': '1',
'name': 'Sendok Semen 7 Bulat',
'currency': 'Rp',
'price': '8.500'},
{'pos': '2',
'name': 'Sendok Semen 8 Bulat Gagang Kayu',
'currency': 'Rp',
'price': '10.000'},
{'pos': '3', 'name': 'SEMEN', 'currency': 'Rp', 'price': '10.000'},
{'pos': '4',
'name': 'Sendok Semen 8 Gagang Kayu SWARDFISH',
'currency': 'Rp',
'price': '10.000'},...]
It will just help you to get the Top 10 - List Of Popular Semen Prices In Ralali and not all goods and prices on the page --> That is something you should clarify in your question.
Use an api that is provided by the website and iterate by parameter pages:
import requests
url = 'https://rarasearch.ralali.com/v2/search/item?q=semen'
res = requests.get(url)
data = []
for p in range(1, round(res.json()['total_item']/20)):
url = f'https://rarasearch.ralali.com/v2/search/item?q=semen&p={p}'
res = requests.get(url)
data.extend(res.json()['items'])
print(data)
Output:
[{'id': 114797,
'name': 'TIGA RODA Semen NON semen putih',
'image': 'assets/img/Libraries/114797_TIGA_RODA_Semen_NON_semen_putih_1_UrwztohXHo9u1yRY_1625473149.png',
'alias': 'tiga-roda-semen-non-semen-putih-157561001',
'vendor_id': 21156,
'vendor_alias': 'prokonstruksi',
'rating': '5.00',
'vendor_status': 'A',
'vendor_name': 'Pro Konstruksi',
'vendor_location': 'Palembang',
'price': '101500.00',
'discount': 0,
'discount_percentage': 0,
'free_ongkir_lokal': 0,
'free_ongkir_nusantara': 1,
'is_stock_available': 1,
'minimum_order': 1,
'maximum_order': 999999999,
'unit_type': 'unit',
'ss_type': 0,
'is_open': 'Y',
'wholesale_price': []},
{'id': 268711,
'name': 'Sendok Semen Ukuran 6',
'image': 'assets/img/Libraries/268711_Sendok-Semen-Ukuran-6_HCLcQq6TUh5IiEPZ_1553521818.jpeg',
'alias': 'Sendok-Semen-Ukuran-6',
'vendor_id': 305459,
'vendor_alias': 'distributorbangunan',
'rating': None,
'vendor_status': 'A',
'vendor_name': 'Distributor Bangunan',
'vendor_location': 'Bandung',
'price': '11000.00',
'discount': 0,
'discount_percentage': 0,
'free_ongkir_lokal': 0,
'free_ongkir_nusantara': 0,
'is_stock_available': 1,
'minimum_order': 1,
'maximum_order': 999999999,
'unit_type': 'Unit',
'ss_type': 0,
'is_open': 'Y',
'wholesale_price': []},...]
Use selenium
, scroll to the bottom of the page toa load all products, push the driver.page_source
to your soup
and start selecting, ...