Search code examples
python-3.xpandasparsingdataframedata-processing

Read a NASDAQ HTML table to a Dataframe


I get the most recent list of traded companies from NASDAQ with this code however I'd like to have the results shown in a data-frame instead of just the list with all the other info I might not need.

Any ideas how could that be achieved? Thanks

Parsing latest NASDAQ company

    from bs4 import BeautifulSoup
    import requests

    r=requests.get('https://www.nasdaq.com/screening/companies-by 
    industry.aspx 
    exchange=NASDAQ&sortname=marketcap&sorttype=1&pagesize=4000')
    data = r.text
    soup = BeautifulSoup(data, "html.parser")
    table = soup.find( "table", {"id":"CompanylistResults"} )
    for row in table.findAll("tr"):
        for cell in row("td"):
            print (cell.get_text().strip())

Solution

  • Looks like you are looking for the aptly named read_html, though you need to play around until you get what you want. In your case:

    >>> import pandas as pd
    >>> df=pd.read_html(table.prettify(),flavor='bs4')[0]
    >>> df.columns = [c.strip() for c in df.columns]
    

    See output below.

    The first line is what gets the job done, and the second one just strips off all those pesky spaces and new lines in your header. Looks like there is a hidden ADR TSO which seems useless, so you could drop it if you do not know what it is. It may also make sense to drop all even rows as they are just a continuation of the odd rows, and useless links as far as I can tell. In a single line:

    >>> df = df.drop(['ADR TSO'], axis=1) #Drop useless column
    >>> df1= df[::2] #To get rid of even rows
    >>> df2= df[~df['Name'].str.contains('Stock Quote')].head() #By string filtration if we are not sure about the odd/even thing
    

    Output of original head just for show:

    >>> df.head()
                                                    Name Symbol Market Cap  \
    0                                   Amazon.com, Inc.   AMZN   $802.18B
    1  AMZN Stock Quote  AMZN Ratings  AMZN Stock Report    NaN        NaN
    2                              Microsoft Corporation   MSFT   $789.12B
    3  MSFT Stock Quote  MSFT Ratings  MSFT Stock Report    NaN        NaN
    4                                      Alphabet Inc.  GOOGL    $740.3B
    
       ADR TSO        Country IPO Year  \
    0      NaN  United States     1997
    1      NaN            NaN      NaN
    2      NaN  United States     1986
    3      NaN            NaN      NaN
    4      NaN  United States      n/a
    
                                             Subsector
    0                   Catalog/Specialty Distribution
    1                                              NaN
    2          Computer Software: Prepackaged Software
    3                                              NaN
    4  Computer Software: Programming, Data Processing
    

    Output of cleaned df.head():

                        Name Symbol Market Cap        Country IPO Year  \
    0       Amazon.com, Inc.   AMZN   $802.18B  United States     1997
    2  Microsoft Corporation   MSFT   $789.12B  United States     1986
    4          Alphabet Inc.  GOOGL    $740.3B  United States      n/a
    6          Alphabet Inc.   GOOG   $735.24B  United States     2004
    8             Apple Inc.   AAPL    $720.3B  United States     1980
    
                                             Subsector
    0                   Catalog/Specialty Distribution
    2          Computer Software: Prepackaged Software
    4  Computer Software: Programming, Data Processing
    6  Computer Software: Programming, Data Processing
    8                           Computer Manufacturing