Search code examples
pythonpandasdataframeurlpython-requests-html

Python University Names and Abbrevations and Weblink


I want to prepare a dataframe of universities, its abbrevations and website link.

My code:

abb_url = 'https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States'
abb_html = requests.get(abb_url).content
abb_df_list = pd.read_html(abb_html)

Present answer:

ValueError: No tables found

Expected answer:

df =

|      |  university_full_name              |  uni_abb  |  uni_url|
---------------------------------------------------------------------
|    0 |  Albert Einstein College of Medicine | AECOM   |  https://en.wikipedia.org/wiki/Albert_Einstein_College_of_Medicine|

Solution

  • That's one funky page you have there...

    First, there are indeed no tables in there. Second, some organizations don't have links, others have redirect links and still others use the same abbreviation for more than one organization.

    So you need to bring in the heavy artillery: xpath...

    import pandas as pd
    import requests
    from lxml import html as lh
    
    url = "https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States"
    response = requests.get(url)
    
    doc = lh.fromstring(response.text)
    rows = []
    for uni in doc.xpath('//h2[./span[@class="mw-headline"]]//following-sibling::ul//li'):    
        info = uni.text.split(' – ')
        abb = info[0]
        
        #for those w/ no links
        if not uni.xpath('.//a'):
            rows.append((abb," ",info[1]))
    
        #now to account for those using the same abbreviation for multiple teams
        for a in uni.xpath('.//a'):
            dat = a.xpath('./@*')
            
            #for those with redirects
            if len(dat)==3:
                del dat[1]
            link = f"https://en.wikipedia.org{dat[0]}"
            rows.append((abb,link,dat[1]))
       
    #and now, at last, to the dataframe
    cols = ['abb','url','full name']
    df = pd.DataFrame(rows,columns=cols)
    df
    

    Output:

        abb     url                                               full name
    0   AECOM   https://en.wikipedia.org/wiki/Albert_Einstein_...   Albert Einstein College of Medicine
    1   AFA     https://en.wikipedia.org/wiki/United_States_Ai...   United States Air Force Academy       
    

    etc.

    Note: you can rearrange the order of columns in the dataframe, if you are so inclined.