Help me please,, I want to take 2 data in "td", "Barcode" and "nama produk" but I get very bad data. what should I fix?
import csv
import requests
from bs4 import BeautifulSoup
outfile = open("dataaa.csv","w",newline='')
writer = csv.writer(outfile)
page = 0
while page < 3 :
url = "http://ciumi.com/cspos/barcode-ritel.php?page={:d}".format(page)
response = requests.get(url)
tree = BeautifulSoup(response.text, 'html.parser')
page += 1
table_tag = tree.select("table")[0]
tab_data = [[item.text for item in row_data.select("tr")]
for row_data in table_tag.select("td")]
for data in tab_data:
writer.writerow(data)
print(table_tag)
print(response, url, ' '.join(data))
import fileinput
seen = set()
for line in fileinput.FileInput('dataaa.csv', inplace=1):
if line in seen: continue
seen.add(line)
print (line)
What do I need to improve to get beautiful results?
It looks like pages start from 1 so my range loop starts there. Then you can use Session object for efficiency of re-using connection. If you choose your css selectors wisely all filtering can be done at that level and you then only work with the required elements retrieved. You can use more lightweight csv
rather than heavier pandas
import.
Requires bs4 4.7.1+ as leverages :has
pseudo selector.
Quick explanation:
The following selects the first column barcodes by targeting only center
elements with type selector center
soup.select('center')
Then
soup.select('td:has(center) + td')
selects for the second column by using adjacent sibling combinator to get the right hand side adjacent table cell next to the left hand side table cell (td) which has center
child element.
The retrieved tag lists have their .text
extracted, and stripped, within list comprehensions and then these are zipped and converted to a list again; and appended to final list results
which is later looped to write out to csv.
The css selectors are kept minimal to allow for faster matching.
import requests, csv
from bs4 import BeautifulSoup as bs
results = []
with requests.Session() as s:
for page in range(1,4): #pages start at 1 and assuming you actually want first 3
r = s.get(f'http://ciumi.com/cspos/barcode-ritel.php?page={page}')
soup = bs(r.content, 'lxml')
results += list(zip([i.text.strip() for i in soup.select('center')] , [i.text.strip() for i in soup.select('td:has(center) + td')]))
with open("data.csv", "w", encoding="utf-8-sig", newline='') as csv_file:
w = csv.writer(csv_file, delimiter = ",", quoting=csv.QUOTE_MINIMAL)
w.writerow(['Barcode','Nama Produk'])
for line in results:
w.writerow(line)
Additional reading: