Search code examples
pythonhtmlextract

Identify Webpage


Hi I am trying to parse a webpage in Python. This webpage is in a restricted area so I can not give the link. In this webpage you can do queries which then are published in a table which is added on the same webpage, but with new url. When I parse the page I get everything except the table.

I have noticed that it does not matter how my queries are, the url is always the same. So I always get the same result from my parser, which is the webpage without the query result (the table). But if I inspect the webpage (in Chrome) then the table and its results is included in the HTML. My parser just look like this:

import urllib.request
with urllib.request.urlopen("http://www.home_page.com") as url:
    s = url.read()
#I'm guessing this would output the html source code?
print(s)

Then my question, are there some other way to identify the webpage so I will receive everything that is published on the webpage?


Solution

  • will based on your question i think you are looking up for web scraping techniques

    will here is what i'm suggesting you could use regular expressing to get data that can be expressed in specific patterns
    for example

    import urllib,re
    siteContent  = urllib.urlopen("http://example.com").read()
    GetBoldWords = re.findall(r"<b>[\w\d ]+",siteContent)
    print "Bold Words are :"
    print getBoldWords
    

    so in this case you have to learn more about regex (regular expression) and get your own pattern

    in some specific cases you might have to deal with Client Side (for example you have to submit query's through pop up pages from javascript or you have to ignore some alert from javascript then you have to use web browsers api , you could use Selenium to deal with this kind of issues