python pandas dataframe import html-parsing

extracting data from an html table in <p> rather than <table>

I had been using pd.read_html to try to extract data from a url, but the data is listed in

tags rather than . I am probably missing a simple lesson here, but I am not sure what function to use to get a good result (a table) rather than the long string i was getting. Any suggestions would be appreciated! I used both of these and get the same result:

import requests import pandas as pd url ='http://www.linfo.org/acronym_list.html' dfs = pd.read_html(url, header =0) df = pd.concat(dfs) df

import pandas as pd
url ='http://www.linfo.org/acronym_list.html'
data = pd.read_html(url, header=0)
data[0]

Out[1]:

ABCDEFGHIJKLMNOPQRSTUVWXYZ A AMD Advanced Micro Devices API application programming interface ARP address resolution protocol ARPANET Advanced Research Projects Agency Network AS autonomous system ASCII American Standard Code for Information Interchange AT&T American Telephone and Telegraph Company ATA advanced technology attachment ATM asynchronous transfer mode B B byte BELUG Bellevue Linux Users Group BGP border gateway protocol...

Solution

I'm using BeautifulSoup for parse the request html each tag p and br , the final result is a dataframe...later you can export it on a excel file...I hope that can help you

from bs4 import BeautifulSoup
import requests
import pandas as pd

result = requests.get('http://www.linfo.org/acronym_list.html')
c = result.content
soup = BeautifulSoup(c, "html.parser")
samples = soup.find_all("p")

rows_list = []

for row in samples:
    tagstrong = row.find_all("strong")
    for x in tagstrong:
        #print(x.get_text())
        tagbr = row.find_all("br")
        for y in tagbr:
            new_row = {'letter':x.get_text(), 'content':y.next}
            rows_list.append(new_row)

df1 = pd.DataFrame(rows_list)
print(df1.head(10))

this is the result :