Search code examples
pythoncsvweb-scrapingbeautifulsoup

How to fix scrape web table output csv with python and bs4


Help me please,, I want to take 2 data in "td", "Barcode" and "nama produk" but I get very bad data. what should I fix?

import csv
import requests
from bs4 import BeautifulSoup


outfile = open("dataaa.csv","w",newline='')
writer = csv.writer(outfile)


page = 0
while page < 3 :
    url = "http://ciumi.com/cspos/barcode-ritel.php?page={:d}".format(page)
    response = requests.get(url)
    tree = BeautifulSoup(response.text, 'html.parser')
    page += 1
    table_tag = tree.select("table")[0]
    tab_data = [[item.text for item in row_data.select("tr")]
    for row_data in table_tag.select("td")]
    for data in tab_data:
        writer.writerow(data)
        print(table_tag)
        print(response, url, ' '.join(data))


import fileinput
seen = set() 
for line in fileinput.FileInput('dataaa.csv', inplace=1):
    if line in seen: continue

    seen.add(line)
    print (line)

What do I need to improve to get beautiful results?


Solution

  • It looks like pages start from 1 so my range loop starts there. Then you can use Session object for efficiency of re-using connection. If you choose your css selectors wisely all filtering can be done at that level and you then only work with the required elements retrieved. You can use more lightweight csv rather than heavier pandas import.

    Requires bs4 4.7.1+ as leverages :has pseudo selector.


    Quick explanation:

    The following selects the first column barcodes by targeting only center elements with type selector center

    soup.select('center')
    

    Then

    soup.select('td:has(center) + td')
    

    selects for the second column by using adjacent sibling combinator to get the right hand side adjacent table cell next to the left hand side table cell (td) which has center child element.

    The retrieved tag lists have their .text extracted, and stripped, within list comprehensions and then these are zipped and converted to a list again; and appended to final list results which is later looped to write out to csv.

    The css selectors are kept minimal to allow for faster matching.


    import requests, csv
    from bs4 import BeautifulSoup as bs
    
    results = []
    
    with requests.Session() as s:
        for page in range(1,4):   #pages start at 1 and assuming you actually want first 3
            r = s.get(f'http://ciumi.com/cspos/barcode-ritel.php?page={page}')
            soup = bs(r.content, 'lxml')
            results += list(zip([i.text.strip() for i in soup.select('center')] , [i.text.strip() for i in soup.select('td:has(center) + td')]))
    
    with open("data.csv", "w", encoding="utf-8-sig", newline='') as csv_file:
        w = csv.writer(csv_file, delimiter = ",", quoting=csv.QUOTE_MINIMAL)
        w.writerow(['Barcode','Nama Produk'])
        for line in results:
            w.writerow(line)
    

    Additional reading:

    1. css selectors