I have been attempting to create a web scraper for the following site using requests and BeautifulSoup to pull the information, after which I will be appending this to an Excel sheet using xlsxwriter:
The above link shows each query parameter from the previous page being set to random defaults.
Using requests I am able to deliver a payload with the same link as the above, which I plan to alter using user input once I can get past the '#'
symbol.
This is the code that I'm currently using:
from bs4 import BeautifulSoup
import requests
import xlsxwriter
# Argument variables
payload = {
'jcid': 'value1',
'kw': 'value2',
'classid': 'value3',
'depid': 'value4',
'locid': 'value5',
'postdays': 'value6',
'tenid': 'value7',
'timid': 'value8',
'minsal': 'value9',
'appmethid': 'value10',
'socmajorcode': 'value11'
}
# Request
r = requests.get(
'https://www.calcareers.ca.gov/CalHRPublic/Search/JobSearchResults.aspx#', params = payload)
print(r.url)
The response that I am getting from the print(r.url)
is:
https://www.calcareers.ca.gov/CalHRPublic/Search/JobSearchResults.aspx?jcid=value1&kw=value2&classid=value3&depid=value4&locid=value5&postdays=value6&tenid=value7&timid=value8&minsal=value9&appmethid=value10&socmajorcode=value11
The issue is that the website won't load with a '?'
and instead needs to be passed '#'
.
Any thoughts on how I could accomplish this with requests? It seems like this could be circumvented with selenium, but I wanted to give this a go because I've hit a brick wall with this one.
You can override the requests.get()
method for your use.
import requests
class ASPRequest(requests.Request):
def get(self, url, params, **kwargs):
qstring = '#'
for key, value in params.items():
qstring = qstring+"{}={}&".format(key, value)
return requests.get(url+qstring)
payload = {
'jcid': 'value1',
'kw': 'value2',
'classid': 'value3',
'depid': 'value4',
'locid': 'value5',
'postdays': 'value6',
'tenid': 'value7',
'timid': 'value8',
'minsal': 'value9',
'appmethid': 'value10',
'socmajorcode': 'value11'
}
r = ASPRequest().get(
url = 'https://www.calcareers.ca.gov/CalHRPublic/Search/JobSearchResults.aspx', params = payload)
print(r.url)