I want to prepare a dataframe of universities, its abbrevations and website link.
My code:
abb_url = 'https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States'
abb_html = requests.get(abb_url).content
abb_df_list = pd.read_html(abb_html)
Present answer:
ValueError: No tables found
Expected answer:
df =
| | university_full_name | uni_abb | uni_url|
---------------------------------------------------------------------
| 0 | Albert Einstein College of Medicine | AECOM | https://en.wikipedia.org/wiki/Albert_Einstein_College_of_Medicine|
That's one funky page you have there...
First, there are indeed no tables in there. Second, some organizations don't have links, others have redirect links and still others use the same abbreviation for more than one organization.
So you need to bring in the heavy artillery: xpath...
import pandas as pd
import requests
from lxml import html as lh
url = "https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States"
response = requests.get(url)
doc = lh.fromstring(response.text)
rows = []
for uni in doc.xpath('//h2[./span[@class="mw-headline"]]//following-sibling::ul//li'):
info = uni.text.split(' – ')
abb = info[0]
#for those w/ no links
if not uni.xpath('.//a'):
rows.append((abb," ",info[1]))
#now to account for those using the same abbreviation for multiple teams
for a in uni.xpath('.//a'):
dat = a.xpath('./@*')
#for those with redirects
if len(dat)==3:
del dat[1]
link = f"https://en.wikipedia.org{dat[0]}"
rows.append((abb,link,dat[1]))
#and now, at last, to the dataframe
cols = ['abb','url','full name']
df = pd.DataFrame(rows,columns=cols)
df
Output:
abb url full name
0 AECOM https://en.wikipedia.org/wiki/Albert_Einstein_... Albert Einstein College of Medicine
1 AFA https://en.wikipedia.org/wiki/United_States_Ai... United States Air Force Academy
etc.
Note: you can rearrange the order of columns in the dataframe, if you are so inclined.