Search code examples
pythonhtmlpython-3.xpandas-datareader

Issues Extracting data from HTML table and printing to CSV using Python


I am trying to scrape Stock Ticker List from IB Website and I am having an issue with extracting table information from HTML.

If I use,

import requests
website_url = requests.get('https://www.interactivebrokers.com/en/index.phpf=2222&exch=mexi&showcategories=STK#productbuffer').text
soup = BeautifulSoup(website_url,'lxml')

My_table = soup.find('div',{'class':'table-responsive no-margin'})
print (My_table)

It captures the HTML Data Information, but when I try to use it with the below code, it doesnt like it, so , as a workaround I captured the HTML Table Data Information and parse it manually.

I have the below code :

import pandas as pd
from bs4 import BeautifulSoup

html_string = """
  <div class="table-responsive no-margin">
      <table width="100%" cellpadding="0" cellspacing="0" border="0" 
 class="table table-striped table-bordered">
     <thead>
          <tr>
           <th width="15%" align="left" valign="middle" 
class="table_subheader">IB Symbol</th>
           <th width="55%" align="left" valign="middle" class="table_subheader">Product Description 
                    <span class="text-small">(click link for more details)</span></th>

       <th width="15%" align="left" valign="middle" class="table_subheader">Symbol</th>
       <th width="15%" align="left" valign="middle" class="table_subheader">Currency</th>
      </tr>
     </thead>
     <tbody>
                <tr>
                  <td>0JN9N</td>
                             <td><a href="javascript:NewWindow('https://misc.interactivebrokers.com/cstools/contract_info/index2.php?action=Details&amp;site=GEN&amp;conid=189723078','Details','600','600','custom','front');" class="linkexternal">DSV AS</a></td>
                  <td>0JN9N</td>
       <td>MXN</td>
      </tr>
                <tr>
                  <td>0QBON</td>
                             <td><a href="javascript:NewWindow('https://misc.interactivebrokers.com/cstools/contract_info/index2.php?action=Details&amp;site=GEN&amp;conid=189723075','Details','600','600','custom','front');" class="linkexternal">COLOPLAST-B</a></td>
                  <td>0QBON</td>
       <td>MXN</td>
      </tr>
                <tr>
                  <td>0R87N</td>
                             <td><a href="javascript:NewWindow('https://misc.interactivebrokers.com/cstools/contract_info/index2.php?action=Details&amp;site=GEN&amp;conid=195567802','Details','600','600','custom','front');" class="linkexternal">ASSA ABLOY AB-B</a></td>
                  <td>0R87N</td>
       <td>MXN</td>
      </tr>
          </tbody>
      </table>"""

soup = BeautifulSoup(html_string, 'lxml') # Parse the HTML as a string
table = soup.find_all('table')[0] # Grab the first table
new_table = pd.DataFrame(columns=range(0,4), index = [0]) # I know the size
row_marker = 0

for row in table.find_all('tr'):
    column_marker = 0
    columns = row.find_all('td')
    for column in columns:
        new_table.iat[row_marker,column_marker] = column.get_text()
        column_marker += 1

print(new_table)

But it is only showing the last line :

enter image description here

If I remove the last part , and add the following :

soup = BeautifulSoup(html, 'lxml')
table = soup.find("div")

# The first tr contains the field names.
headings = [th.get_text().strip() for th in 
table.find("tr").find_all("th")]

print(headings)

datasets = []
for row in table.find_all("tr")[1:]:
    df = pd.DataFrame(headings, (td.get_text() for td in 
    row.find_all("td")))
    datasets.append(df)

print(datasets)

df.to_csv('Path_to_file\\test1.csv')

It sees the rest of the items, but it is completely out of format and in the csv, it is only printing the last item of the list.

enter image description here

How can I extract the details of the HTML table directly from the website and print it to csv in the format of the first image ?


Solution

  • You can delete row_marker = 0

    for row_marker, row in enumerate(table.find_all('tr')):
        column_marker = 0
        columns = row.find_all('td')
        try:
            new_table.loc[row_marker] = [column.get_text() for column in columns]
        except ValueError:
            # It's a safe way when [column.get_text() for column in columns] is empty list.
            continue