Search code examples
pythonhtmlbeautifulsoupscreen-scraping

Web Scraping a Directory w/ BeautifulSoup Outside of an Open Classifier


I am trying to scrape names out of a directory using BeautifulSoup, but the way the html is formatted is making it difficult for me. Here is an example of a name in the directory:

    <li><span class="image-wrapper-outer"><span class="image-wrapper-inner"><img src="/directory/images/1234.jpg" alt="student photo"/></span></span><strong>Name:</strong> Alex Example<br/>
    <strong>Email:</strong> <a href="mailto:[email protected]">[email protected]</a><br/>
    <strong>Year:</strong> 2017<br/>
    <strong>Box #:</strong> 123<br/>
    <strong>Local phone:</strong> 1234<br/>
    <strong>Home Info:</strong> 7033 Fake St.<br/>Chicago NY 90210 <br/>
    <strong>Advisors:</strong> Advisor1, Advisor2<br/><br/></li>

I'm not very experienced with HTML, but I cannot find an open "name John Doe name/" that is carrying the information I am trying to scrape.

Here is my existing code:

def makeSoup(url):
    r  = requests.get(url)
    data = r.text
    soup = BeautifulSoup(data)
    return soup

for i in range(0,1):
    souptemp = makeSoup(url_list[i])
    for link in souptemp.find_all('need help here'):
        print link

Thank you for the help today.


Solution

  • You could remove the strong tags and retrieve the name by splitting the text by lines:

    soup = BeautifulSoup(data)
    
    [s.extract() for s in soup.find_all('strong')
    print soup.text.split('\n')[0]