I want to scrape some LinkedIn company pages with cURL and PHP. The API of LinkedIn is not build for that, so I have to do this with PHP. If there are any other options, please let me know...
Before scraping the company page I have to sign in at LinkedIn with a personal account via cURL, but it doesn't seems to work.
I've got a 'No CSRF token found in headers' error.
Could someone help me out?
Thanks!
<?php
require_once 'dom/simple_html_dom.php';
$linkedin_login_page = "https://www.linkedin.com/uas/login";
$username = 'linkedin_username';
$password = 'linkedin_password';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $linkedin_login_page);
curl_setopt($ch, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17');
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($ch, CURLOPT_VERBOSE, 1);
$login_content = str_get_html(curl_exec($ch));
if(curl_error($ch)) {
echo 'error:' . curl_error($ch);
}
if ($login_content) {
if (($login_content->find('input[name=isJsEnabled]', 0))) {
foreach($login_content->find('input[name=isJsEnabled]') as $element) {
$isJsEnabled = trim($element->value);
if ($isJsEnabled === "false") {
$isJsEnabled = "true";
}
}
}
if (($login_content->find('input[name=source_app]', 0))) {
foreach($login_content->find('input[name=source_app]') as $element) {
$source_app = trim($element->value);
}
}
if (($login_content->find('input[name=tryCount]', 0))) {
foreach($login_content->find('input[name=tryCount]') as $element) {
$tryCount = trim($element->value);
}
}
if (($login_content->find('input[name=clickedSuggestion]', 0))) {
foreach($login_content->find('input[name=clickedSuggestion]') as $element) {
$clickedSuggestion = trim($element->value);
}
}
if (($login_content->find('input[name=session_redirect]', 0))) {
foreach($login_content->find('input[name=session_redirect]') as $element) {
$session_redirect = trim($element->value);
}
}
if (($login_content->find('input[name=trk]', 0))) {
foreach($login_content->find('input[name=trk]') as $element) {
$trk = trim($element->value);
}
}
if (($login_content->find('input[name=loginCsrfParam]', 0))) {
foreach($login_content->find('input[name=loginCsrfParam]') as $element) {
$loginCsrfParam = trim($element->value);
}
}
if (($login_content->find('input[name=fromEmail]', 0))) {
foreach($login_content->find('input[name=fromEmail]') as $element) {
$fromEmail = trim($element->value);
}
}
if (($login_content->find('input[name=csrfToken]', 0))) {
foreach($login_content->find('input[name=csrfToken]') as $element) {
$csrfToken = trim($element->value);
}
}
if (($login_content->find('input[name=sourceAlias]', 0))) {
foreach($login_content->find('input[name=sourceAlias]') as $element) {
$sourceAlias = trim($element->value);
}
}
}
curl_setopt($ch, CURLOPT_URL, "https://www.linkedin.com/uas/login-submit");
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, 'isJsEnabled='.$isJsEnabled.'&source_app='.$source_app.'&tryCount='.$tryCount.'&clickedSuggestion='.$clickedSuggestion.'&session_key='.$username.'&session_password='.$password.'&session_redirect='.$session_redirect.'&trk='.$trk.'&loginCsrfParam='.$loginCsrfParam.'&fromEmail='.$fromEmail.'&csrfToken='.$csrfToken.'&sourceAlias='.$sourceAlias);
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$store = curl_exec($ch);
curl_setopt($ch, CURLOPT_URL, 'https://www.linkedin.com/company/facebook');
curl_setopt($ch, CURLOPT_POST, false);
curl_setopt($ch, CURLOPT_POSTFIELDS, "");
$content = curl_exec($ch);
curl_close($ch);
echo $content;
?>
Here is a solution for the login , if you want to make sure that is working just save the content in a file and you will see that the login was successful
instead of using simple_html_dom we used above fetch_value, you still can use simple_html_dom
<?php
function fetch_value($str, $find_start = '', $find_end = '')
{
if ($find_start == '')
{
return '';
}
$start = strpos($str, $find_start);
if ($start === false)
{
return '';
}
$length = strlen($find_start);
$substr = substr($str, $start + $length);
if ($find_end == '')
{
return $substr;
}
$end = strpos($substr, $find_end);
if ($end === false)
{
return $substr;
}
return substr($substr, 0, $end);
}
$linkedin_login_page = "https://www.linkedin.com/uas/login";
$linkedin_ref = "https://www.linkedin.com";
$username = 'username';
$password = 'password';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $linkedin_login_page);
curl_setopt($ch, CURLOPT_REFERER, $linkedin_ref);
curl_setopt($ch, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.12) Gecko/20050915 Firefox/1.0.7)');
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookie.txt');
$login_content = curl_exec($ch);
if(curl_error($ch)) {
echo 'error:' . curl_error($ch);
}
$var = array(
'isJsEnabled' => 'false',
'source_app' => '',
'clickedSuggestion' => 'false',
'session_key' => trim($username),
'session_password' => trim($password),
'signin' => 'Sign In',
'session_redirect' => '',
'trk' => '',
'fromEmail' => '');
$var['loginCsrfParam'] = fetch_value($login_content, 'type="hidden" name="loginCsrfParam" value="', '"');
$var['csrfToken'] = fetch_value($login_content, 'type="hidden" name="csrfToken" value="', '"');
$var['sourceAlias'] = fetch_value($login_content, 'input type="hidden" name="sourceAlias" value="', '"');
$post_array = array();
foreach ($var as $key => $value)
{
$post_array[] = urlencode($key) . '=' . urlencode($value);
}
$post_string = implode('&', $post_array);
curl_setopt($ch, CURLOPT_URL, "https://www.linkedin.com/uas/login-submit");
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_string);
$store = curl_exec($ch);
if (stripos($store, "session_password-login-error") !== false){
$err = trim(strip_tags(fetch_value($store, '<span class="error" id="session_password-login-error">', '</span>')));
echo "Login error : ".$err;
}elseif (stripos($store, 'profile-nav-item') !== false) {
curl_setopt($ch, CURLOPT_URL, 'https://www.linkedin.com/company-beta/10667/?pathWildcard=10667');
curl_setopt($ch, CURLOPT_POST, false);
curl_setopt($ch, CURLOPT_POSTFIELDS, "");
$content = curl_exec($ch);
curl_close($ch);
echo $content;
}else{
echo "unknown error";
}
?>
You will notice that the company page doesn't load , as linkedin has just changed their design and their company links to keep tracking opened companies pages.