python visual-studio web-scraping python-3.4 ptvs

Extracting links and titles only

I am trying to extract links and titles for these links in an anime website, However, I am only able to extract the whole tag, I just want the href and the title.

Here`s the code am using:

import requests
from bs4 import BeautifulSoup

r = requests.get('http://animeonline.vip/info/phi-brain-kami-puzzle-3')
soup = BeautifulSoup(r.content, "html.parser")
for link in soup.find_all('div', class_='list_episode'):
    href = link.get('href')
    print(href)

And here`s the website html:

<a href="http://animeonline.vip/phi-brain-kami-puzzle-3-episode-25" title="Phi Brain: Kami no Puzzle 3 episode 25">
                    Phi Brain: Kami no Puzzle 3 episode 25                  <span> 26-03-2014</span>
        </a>

And this is the output:

C:\Python34\python.exe C:/Users/M.Murad/PycharmProjects/untitled/Webcrawler.py
None

Process finished with exit code 0

All that I want is all links and titles in that class (episodes and their links)

Thanks.

Solution

So what is happening is, your link element has all the information in anchor <div> and class = "last_episode" but this has a lot of anchors in it which holds the link in "href" and title in "title".

Just modify the code a little and you will have what you want.

import requests
from bs4 import BeautifulSoup

r = requests.get('http://animeonline.vip/info/phi-brain-kami-puzzle-3')
soup = BeautifulSoup(r.content, "html.parser")
for link in soup.find_all('div', class_='list_episode'):
    href_and_title = [(a.get("href"), a.get("title")) for a in link.find_all("a")]   
    print href_and_title

output will be in form of [(href,title),(href,title),........(href,title)]

Edit(Explanation):

So what is happening is when you do

soup.find_all('div', class_='list_episode')

It gives you all details (in html page) with "div" and class "last_episode" but now this anchor holds a huge set of anchors with different "href" and title details, so to get that we use a for loop (there can be multiple anchors (<a>)) and ".get()".

 href_and_title = [(a.get("href"), a.get("title")) for a in link.find_all("a")]

I hope it's clearer this time .