I had been using pd.read_html to try to extract data from a url, but the data is listed in
tags rather than . I am probably missing a simple lesson here, but I am not sure what function to use to get a good result (a table) rather than the long string i was getting. Any suggestions would be appreciated! I used both of these and get the same result:
import requests
import pandas as pd
url ='http://www.linfo.org/acronym_list.html'
dfs = pd.read_html(url, header =0)
df = pd.concat(dfs)
df
import pandas as pd
url ='http://www.linfo.org/acronym_list.html'
data = pd.read_html(url, header=0)
data[0]
Out[1]:
ABCDEFGHIJKLMNOPQRSTUVWXYZ A AMD Advanced Micro Devices API application programming interface ARP address resolution protocol ARPANET Advanced Research Projects Agency Network AS autonomous system ASCII American Standard Code for Information Interchange AT&T American Telephone and Telegraph Company ATA advanced technology attachment ATM asynchronous transfer mode B B byte BELUG Bellevue Linux Users Group BGP border gateway protocol...
I'm using BeautifulSoup for parse the request html each tag p and br , the final result is a dataframe...later you can export it on a excel file...I hope that can help you
from bs4 import BeautifulSoup
import requests
import pandas as pd
result = requests.get('http://www.linfo.org/acronym_list.html')
c = result.content
soup = BeautifulSoup(c, "html.parser")
samples = soup.find_all("p")
rows_list = []
for row in samples:
tagstrong = row.find_all("strong")
for x in tagstrong:
#print(x.get_text())
tagbr = row.find_all("br")
for y in tagbr:
new_row = {'letter':x.get_text(), 'content':y.next}
rows_list.append(new_row)
df1 = pd.DataFrame(rows_list)
print(df1.head(10))
this is the result :