I am new to Python webscraping, rest api, html. First of all I have to say that there are many different solutions similar to my question. But my question is with Intranet website and not similar to the any of other questiosn. I researched almost every link since days and after all failed attempts, I am posting this question as I did not get any help. Please do consider my efforts and do not mark it duplicate or unwanted question.
I am trying to automate some column IDs from an internal website. For this, I am using python Web scraping to get the list of IDs of a particular column and then set them to on or off. For example, if an ID matches the ID from an excel file I have in my local, I should toggle the status column (parallel to that ID) in that intranet portal as on or off. For this, I am using requests library. And this intranet website works only after I give specific username, password authentication.
The problem is that I am unable to login to that web portal and then navigate to the page I require using web scraping. All I get is just some part of 'View source code' html script as output. Even when I directly scrape the navigated webpage I want (with payload as username, password), I still get only this home page data. Can anyone suggest me how do I solve scraping the data from the webpage I want after login? I am not sure if I successfully am able to login, because I just get html response <200> as the status code after login. I understand it is a success code that the website is found. But then I am not able to see any data how it looks after login. The scraped data is home page before login.
Output scraped data:
<!doctype html>
<html lang="en" ng-app="lm.login.application" class="lm-scroll-bar html-overflow" ng-strict-di>
<head>
<meta http-equiv="X-UA-Compatible" content="IE=Edge">
<meta charset="utf-8">
<meta name="HandheldFriendly" content="True">
<meta name="viewport"
content='width=device-width,height=device-height, initial-scale=1, maximum-scale=1, minimum-scale=1, user-scalable=no, target-densitydpi=device-dpi'/>
<link rel="icon" href="../favicon.ico?ui-version=12.0.40.12" type="image/x-icon">
<title>Login</title>
<link rel="stylesheet" href="/ui/generated/webpack/authpoint.beaf402df60c88783fc6.min.css?ui-version=12.0.40.12"/>
<script>
var lmSession = {
buildVersion: '76',
redirectTarget: 'https\x3A\x2F\x2F<intanet_webportal_>\x2Dprod.<intanet_address_>group.net\x2Fui\x2F',
language: 'english',
userLanguageCode: 'en',
isMLU: false,
isProduction: true,
isExternalAuthModeEnabled: false,
productBrandEditionDisplayName: 'EDITION PLACEHOLDER',
logLevel: 'error',
siteParams: {"LOGIN_PAGE_NAME_LABEL": ""},
loginNotice: '\x3Cdiv\x20style\x3D\x22font\x2Dsize\x3A120\x25\x3Bcolor\x3Ared\x3B\x22\x3EZur\x20erstmaligen\x20Nutzung\x20seit\x20dem\x20Update\x20Strg\x20\x2B\x20F5\x20dr\xFCcken\x20um\x20den\x20Seiten\x20Cache\x20zu\x20l\xF6schen.\x3C\x2Fdiv\x3E\x3Cbr\x3EWelcome\x20using\x20\x3Ca\x20href\x3D\x22http\x3A\x2F\x2F<intanet_webportal_>.<intanet_address_>group.net\x2F\x22\x20style\x3D\x22background\x2Dcolor\x3A\x23ffffa0\x22\x3ETAEE\x20Next\x3C\x2Fa\x3E\x20via\x20<intanet_webportal_>.\x3Cbr\x3E\x3Ca\x20href\x3D\x22https\x3A\x2F\x2Fvts4.<intanet_address_>group.net\x2Fsites\x2Ftundaee\x2F<intanet_webportal_>\x2FDocuments\x2FTAEE\x2DNext\x2520\x2D\x2520Disclaimer.pdf\x3FWeb\x3D1\x22\x20style\x3D\x22background\x2Dcolor\x3A\x23ffffa0\x22\x3EErkl\xE4rung\x20zum\x20Datenschutz\x2FPrivacy\x20notice\x3C\x2Fa\x3E\x20\x3Cbr\x3E\x3Ca\x20href\x3D\x22https\x3A\x2F\x2Fvts4.<intanet_address_>group.net\x2Fsites\x2Ftundaee\x2F<intanet_webportal_>\x2FDocuments\x2FNUTZUNGSBEDINGUNGEN\x2520TAEE\x2DNext.pdf\x22\x20style\x3D\x22background\x2Dcolor\x3A\x23ffffa0\x22\x3ENutzungsbedingungen\x3C\x2Fa\x3E'
};
</script>
</head>
<body ng-controller="lm.login.application.controller">
<noscript>
<div class="browser-misconfig-alert">LM requires that JavaScript be enabled in your browser</div>
</noscript>
<script src="/ui/generated/webpack/authpoint.17231e2531a66bfe2e17.min.js"></script>
<div class="ng-cloak" class="web-ui-login-main-wrapper">
<div class="web-ui-login-wrapper">
<ng-include src="'login-app.html?ui-version=12.0.40.12'"></ng-include>
</div>
</div>
</body>
</html>
Process finished with exit code 0`
I am able to scrape only this much inspite of all my attempts. But not logging in or natigvate to next page after login and get the field I want.
Methods tried:
With all these methods, I get only this html data that I showed above. There is no csrf token for my website. It has only xsrf header.
Could someone be kind to explain me where am I failing and how can I login, navigate and then get the data by python scraping. I am bound to use only Python due to internal constraints. I understand that, a response of 200 as status code does not mean it is successfully logged in with given user id and password.
Anyhelp would be really highly appreciated. A million ton thanks!! This will be a life saving issue.
As this is intranet webportal, I changed the names, so as not to disclose the data. Hope you understand
Your selenium approach seems correct to me. Here is a slightly adjusted version of your code. Please check element selectors. Main idea is to wait for each element you need using WebDriverWait and scroll to it before performing any actions. For buttons it may be useful to use EC.element_to_be_clickable
instead of EC.presence_of_element_located
.
After retrieving some container element you can use print(element.get_attribute('innerHTML'))
for debugging reasons.
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
# delay for selenium web driver wait
DELAY = 30
# create selenium driver
chrome_options = webdriver.ChromeOptions()
#chrome_options.add_argument('--headless')
#chrome_options.add_argument('--no-sandbox')
driver = webdriver.Chrome('<<PATH_TO_CHROMEDRIVER>>', options=chrome_options)
# open web page
driver.get('<<URL>>')
# maximize window
driver.maximize_window()
# wait for username input, scroll to it, enter username
username = WebDriverWait(driver, DELAY).until(EC.presence_of_element_located((By.ID, "inputusername")))
driver.execute_script("arguments[0].scrollIntoView();", username)
username.send_keys("user")
# wait for password input, scroll to it, enter password
password = WebDriverWait(driver, DELAY).until(EC.presence_of_element_located((By.ID, "password")))
driver.execute_script("arguments[0].scrollIntoView();", password)
password.send_keys("password")
# wait for submit button, scroll to it, click it
submit = WebDriverWait(driver, DELAY).until(EC.presence_of_element_located((By.ID, "login")))
driver.execute_script("arguments[0].scrollIntoView();", submit)
submit.click()
# quit driver
#driver.quit()
If there will be any problems, it would be helpful to add HTML sources of the login page (using previously described element.get_attribute('innerHTML')
approach).