Search code examples
pythonpandasdataframeimporthtml-parsing

extracting data from an html table in <p> rather than <table>


I had been using pd.read_html to try to extract data from a url, but the data is listed in

tags rather than . I am probably missing a simple lesson here, but I am not sure what function to use to get a good result (a table) rather than the long string i was getting. Any suggestions would be appreciated! I used both of these and get the same result:

import requests import pandas as pd url ='http://www.linfo.org/acronym_list.html' dfs = pd.read_html(url, header =0) df = pd.concat(dfs) df

import pandas as pd
url ='http://www.linfo.org/acronym_list.html'
data = pd.read_html(url, header=0)
data[0]

Out[1]:

ABCDEFGHIJKLMNOPQRSTUVWXYZ A AMD Advanced Micro Devices API application programming interface ARP address resolution protocol ARPANET Advanced Research Projects Agency Network AS autonomous system ASCII American Standard Code for Information Interchange AT&T American Telephone and Telegraph Company ATA advanced technology attachment ATM asynchronous transfer mode B B byte BELUG Bellevue Linux Users Group BGP border gateway protocol...


Solution

  • I'm using BeautifulSoup for parse the request html each tag p and br , the final result is a dataframe...later you can export it on a excel file...I hope that can help you

    from bs4 import BeautifulSoup
    import requests
    import pandas as pd
    
    result = requests.get('http://www.linfo.org/acronym_list.html')
    c = result.content
    soup = BeautifulSoup(c, "html.parser")
    samples = soup.find_all("p")
    
    rows_list = []
    
    for row in samples:
        tagstrong = row.find_all("strong")
        for x in tagstrong:
            #print(x.get_text())
            tagbr = row.find_all("br")
            for y in tagbr:
                new_row = {'letter':x.get_text(), 'content':y.next}
                rows_list.append(new_row)
    
    df1 = pd.DataFrame(rows_list)
    print(df1.head(10))
    

    this is the result :

    enter image description here