I have created a crawler. So, now I have a bunch of URLs which were crawled. I need to create an index using a vector space or at least a List of all terms inside HTML.
Suppose this random webpage https://www.centralpark.com/things-to-do/central-park-zoo/polar-bears/
How do I parse all terms in that webpage? I kinda don't understand should I grab text between particular tags or maybe something else or which library I should use? I'm completely lost.
Here is what I need to do with that HTML:
You can use a html parser online, but in principle, you can use the text in the body of the html ... or between tags like this p /p, h2 /h2.
Any help to parse above HTML is appreciated.
EDIT: I'm trying BeautifulSoup:
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url='https://www.centralpark.com/things-to-do/central-park-zoo/polar-bears/'
# opening up connection
uClient = uReq(my_url)
page_html = uClient.read()
# close connection
uClient.close()
page_soup = soup(page_html, features="html.parser")
print(page_soup.p)
How to take all text elements in to List?
Ex:
<p>This is p<\p>
<p>This is another p<\p>
<h1>This is h1<\h1>
maybe some other text tags
to
List = ['This is p','This is another p','This is h1',...]
Good, you're making progress!
I recommend that you pip install requests
and use that. You'll find it is a much more convenient API than urllib. (Also, simply soup
would be the usual name for that variable.)
How to take all text elements in to List?
It's as easy as this:
print(list(page_soup.find_all('p')))
which explains why so many people are quite fond of BeautifulSoup.
This displays an excerpt from the page:
paragraphs = page_soup.find_all('p')
for p in paragraphs:
print(str(p)[:40])
<p class="lead">There are no longer any
<p><strong>Polar Bear</strong> (Ursus Ma
<p><strong>Zoo collection includes:</str
<p><strong>Found in the wild:</strong> A
<p><strong>See Them at the Central Park
<p><strong>Description:</strong> The mal
<p><strong>Zoo Bear Habitat:</strong> Th
<p><strong>What do they eat:</strong> T
<p><strong>Life span:</strong> 25 to 30
<p><strong>Threats:</strong> Global warm
<p><strong>Fun Facts:</strong> A newborn
<p>Copyright © 2004 - 2018 Greensward Gr
It is important to note that p
is not a string.
It is an object that can be searched, just like the soup it came from.
You might want to find <strong>
spans within it.