Hi I am trying to parse a webpage in Python. This webpage is in a restricted area so I can not give the link. In this webpage you can do queries which then are published in a table which is added on the same webpage, but with new url. When I parse the page I get everything except the table.
I have noticed that it does not matter how my queries are, the url is always the same. So I always get the same result from my parser, which is the webpage without the query result (the table). But if I inspect the webpage (in Chrome) then the table and its results is included in the HTML. My parser just look like this:
import urllib.request
with urllib.request.urlopen("http://www.home_page.com") as url:
s = url.read()
#I'm guessing this would output the html source code?
print(s)
Then my question, are there some other way to identify the webpage so I will receive everything that is published on the webpage?
will based on your question i think you are looking up for web scraping techniques
will here is what i'm suggesting
you could use regular expressing to get data that can be expressed in specific patterns
for example
import urllib,re
siteContent = urllib.urlopen("http://example.com").read()
GetBoldWords = re.findall(r"<b>[\w\d ]+",siteContent)
print "Bold Words are :"
print getBoldWords
so in this case you have to learn more about regex (regular expression)
and get your own pattern
in some specific cases you might have to deal with Client Side (for example you have to submit query's through pop up pages from javascript
or you have to ignore some alert
from javascript
then you have to use web browsers api , you could use Selenium
to deal with this kind of issues