Search code examples
pythonhtmlparsinghtml-tablehtml-parsing

Python parsing tables with BeautifulSoup


HTML-page structure:

<table>
    <tbody>
        <tr>
           <th>Timestamp</th>
           <th>Call</th>
           <th>MHz</th>
           <th>SNR</th>
           <th>Drift</th>
           <th>Grid</th>
           <th>Pwr</th>
           <th>Reporter</th>
           <th>RGrid</th>
           <th>km</th> 
           <th>az</th>
        </tr>
        <tr>
           <td align="right">&nbsp;2019-12-10 14:02&nbsp;</td>
           <td align="left">&nbsp;DL1DUZ&nbsp;</td>
           <td align="right">&nbsp;10.140271&nbsp;</td>
           <td align="right">&nbsp;-26&nbsp;</td>
           <td align="right">&nbsp;0&nbsp;</td>
           <td align="left">&nbsp;JO61tb&nbsp;</td>
           <td align="right">&nbsp;0.2&nbsp;</td>
           <td align="left">&nbsp;F4DWV&nbsp;</td>
           <td align="left">&nbsp;IN98bc&nbsp;</td>
           <td align="right">&nbsp;1162&nbsp;</td>
           <td align="right">&nbsp;260&nbsp;</td>
        </tr>
        <tr>
           <td align="right">&nbsp;2019-10-10 14:02&nbsp;</td>
           <td align="left">&nbsp;DL23UH&nbsp;</td>
           <td align="right">&nbsp;11.0021&nbsp;</td>
           <td align="right">&nbsp;-20&nbsp;</td>
           <td align="right">&nbsp;0&nbsp;</td>
           <td align="left">&nbsp;JO61tb&nbsp;</td>
           <td align="right">&nbsp;0.2&nbsp;</td>
           <td align="left">&nbsp;F4DWV&nbsp;</td>
           <td align="left">&nbsp;IN98bc&nbsp;</td>
           <td align="right">&nbsp;1162&nbsp;</td>
           <td align="right">&nbsp;260&nbsp;</td>
        </tr>
    </tbody>
</table>

and so on tr-td... My code:

from bs4 import BeautifulSoup as bs
import requests
import csv

base_url = 'some_url'
session = requests.Session()
request = session.get(base_url)
val_th = []
val_td = []

if request.status_code == 200:
    soup = bs(request.content, 'html.parser')
    table = soup.findChildren('table')
    tr = soup.findChildren('tr')
    my_table = table[0]
    my_tr_th = tr[0]
    my_tr_td = tr[1]
    rows = my_table.findChildren('tr')
    row_th = my_tr_th.findChildren('th')
    row_td = my_tr_td.findChildren('td')
    for r_th in row_th:
       heading = r_th.text
       val_th.append(heading)
    for r_td in row_td:
        data = r_td.text
        val_td.append(data)
    with open('output.csv', 'w') as f:
        a_pen = csv.writer(f)
        a_pen.writerow(val_th)
        a_pen.writerow(val_td)

1) I printed 1 line of td. How to make sure that all the lines of td on the page are displayed in csv? 2) td tags - many on the page. 3) If my_tr_td = tr[1] write as my_tr_td = tr[1:50] - it's mistake. How to write all data from tr-td lines to a csv file?

Thanks in advance.


Solution

  • Let's try it this way:

    import lxml.html
    import csv
    import requests
    
    url = "http://wsprnet.org/drupal/wsprnet/spots"
    res = requests.get(url)
    
    doc = lxml.html.fromstring(res.text)
    
    cols = []
    #first, we need to extract the column headers, stuck all the way at the top, with the first one in a particular location and format
    
    cols.append(doc.xpath('//table/tr/node()/text()')[0])
    for item in doc.xpath('//table/tr/th'):
        typ = str(type(item.getnext()))
        if not 'NoneType' in typ:        
            cols.append(item.getnext().text)
    #now for the actual data
    inf = []
    for item in doc.xpath('//table//tr//td'):
        inf.append(item.text.replace('\\xa02', '').strip()) #text info needs to be cleaned
    
    #this will take all the data and split it into rows for each column
    rows = [inf[x:x+len(cols)] for x in range(0, len(inf), len(cols))]
    
    #finally, write to file:
    with open("output.csv", "w", newline='') as f:
        writer = csv.writer(f)
        writer.writerow(cols) 
        for l in rows:
            writer.writerow(l)