python pandas dataframe url python-requests-html

Python University Names and Abbrevations and Weblink

I want to prepare a dataframe of universities, its abbrevations and website link.

My code:

abb_url = 'https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States'
abb_html = requests.get(abb_url).content
abb_df_list = pd.read_html(abb_html)

Present answer:

ValueError: No tables found

Expected answer:

df =

|      |  university_full_name              |  uni_abb  |  uni_url|
---------------------------------------------------------------------
|    0 |  Albert Einstein College of Medicine | AECOM   |  https://en.wikipedia.org/wiki/Albert_Einstein_College_of_Medicine|

Solution

That's one funky page you have there...

First, there are indeed no tables in there. Second, some organizations don't have links, others have redirect links and still others use the same abbreviation for more than one organization.

So you need to bring in the heavy artillery: xpath...

import pandas as pd
import requests
from lxml import html as lh

url = "https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States"
response = requests.get(url)

doc = lh.fromstring(response.text)
rows = []
for uni in doc.xpath('//h2[./span[@class="mw-headline"]]//following-sibling::ul//li'):    
    info = uni.text.split(' – ')
    abb = info[0]
    
    #for those w/ no links
    if not uni.xpath('.//a'):
        rows.append((abb," ",info[1]))

    #now to account for those using the same abbreviation for multiple teams
    for a in uni.xpath('.//a'):
        dat = a.xpath('./@*')
        
        #for those with redirects
        if len(dat)==3:
            del dat[1]
        link = f"https://en.wikipedia.org{dat[0]}"
        rows.append((abb,link,dat[1]))
   
#and now, at last, to the dataframe
cols = ['abb','url','full name']
df = pd.DataFrame(rows,columns=cols)
df

Output:

    abb     url                                               full name
0   AECOM   https://en.wikipedia.org/wiki/Albert_Einstein_...   Albert Einstein College of Medicine
1   AFA     https://en.wikipedia.org/wiki/United_States_Ai...   United States Air Force Academy

etc.

Note: you can rearrange the order of columns in the dataframe, if you are so inclined.