Search code examples
pythonpython-3.xhtml-parsing

Parsing HTML to retrieve terms


I have created a crawler. So, now I have a bunch of URLs which were crawled. I need to create an index using a vector space or at least a List of all terms inside HTML.

Suppose this random webpage https://www.centralpark.com/things-to-do/central-park-zoo/polar-bears/

How do I parse all terms in that webpage? I kinda don't understand should I grab text between particular tags or maybe something else or which library I should use? I'm completely lost.

Here is what I need to do with that HTML:

You can use a html parser online, but in principle, you can use the text in the body of the html ... or between tags like this p /p, h2 /h2.

Any help to parse above HTML is appreciated.

EDIT: I'm trying BeautifulSoup:

import bs4
from urllib.request import  urlopen as uReq
from bs4 import BeautifulSoup as soup

    my_url='https://www.centralpark.com/things-to-do/central-park-zoo/polar-bears/'
    # opening up connection
    uClient = uReq(my_url)
    page_html = uClient.read()
    # close connection
    uClient.close()
    page_soup = soup(page_html, features="html.parser")
    print(page_soup.p)

How to take all text elements in to List?

Ex:

<p>This is p<\p>
<p>This is another p<\p>
<h1>This is h1<\h1>
maybe some other text tags

to

List = ['This is p','This is another p','This is h1',...]

Solution

  • Good, you're making progress!

    I recommend that you pip install requests and use that. You'll find it is a much more convenient API than urllib. (Also, simply soup would be the usual name for that variable.)

    How to take all text elements in to List?

    It's as easy as this:

        print(list(page_soup.find_all('p')))
    

    which explains why so many people are quite fond of BeautifulSoup.

    This displays an excerpt from the page:

        paragraphs = page_soup.find_all('p')
        for p in paragraphs:
            print(str(p)[:40])
    
    <p class="lead">There are no longer any 
    <p><strong>Polar Bear</strong> (Ursus Ma
    <p><strong>Zoo collection includes:</str
    <p><strong>Found in the wild:</strong> A
    <p><strong>See Them at the Central Park 
    <p><strong>Description:</strong> The mal
    <p><strong>Zoo Bear Habitat:</strong> Th
    <p><strong>What do they eat:</strong>  T
    <p><strong>Life span:</strong> 25 to 30 
    <p><strong>Threats:</strong> Global warm
    <p><strong>Fun Facts:</strong> A newborn
    <p>Copyright © 2004 - 2018 Greensward Gr
    

    It is important to note that p is not a string. It is an object that can be searched, just like the soup it came from. You might want to find <strong> spans within it.