Search code examples
pythonvisual-studioweb-scrapingpython-3.4ptvs

Extracting links and titles only


I am trying to extract links and titles for these links in an anime website, However, I am only able to extract the whole tag, I just want the href and the title.

Here`s the code am using:

import requests
from bs4 import BeautifulSoup

r = requests.get('http://animeonline.vip/info/phi-brain-kami-puzzle-3')
soup = BeautifulSoup(r.content, "html.parser")
for link in soup.find_all('div', class_='list_episode'):
    href = link.get('href')
    print(href)

And here`s the website html:

<a href="http://animeonline.vip/phi-brain-kami-puzzle-3-episode-25" title="Phi Brain: Kami no Puzzle 3 episode 25">
                    Phi Brain: Kami no Puzzle 3 episode 25                  <span> 26-03-2014</span>
        </a>

And this is the output:

C:\Python34\python.exe C:/Users/M.Murad/PycharmProjects/untitled/Webcrawler.py
None

Process finished with exit code 0

All that I want is all links and titles in that class (episodes and their links)

Thanks.


Solution

  • So what is happening is, your link element has all the information in anchor <div> and class = "last_episode" but this has a lot of anchors in it which holds the link in "href" and title in "title".

    Just modify the code a little and you will have what you want.

    import requests
    from bs4 import BeautifulSoup
    
    r = requests.get('http://animeonline.vip/info/phi-brain-kami-puzzle-3')
    soup = BeautifulSoup(r.content, "html.parser")
    for link in soup.find_all('div', class_='list_episode'):
        href_and_title = [(a.get("href"), a.get("title")) for a in link.find_all("a")]   
        print href_and_title
    

    output will be in form of [(href,title),(href,title),........(href,title)]

    Edit(Explanation):

    So what is happening is when you do

    soup.find_all('div', class_='list_episode')
    

    It gives you all details (in html page) with "div" and class "last_episode" but now this anchor holds a huge set of anchors with different "href" and title details, so to get that we use a for loop (there can be multiple anchors (<a>)) and ".get()".

     href_and_title = [(a.get("href"), a.get("title")) for a in link.find_all("a")]
    

    I hope it's clearer this time .