Search code examples
c#asp.netvb.netweb-scrapinggridview

How do I scrape information off ASP.NET websites when paging and JavaScript links are being used?


I have been given a staff list which is supposed to be up to date but it doesn't match an intranet People Finder which is written in ASP.NET.

As the information is sensitive I am not able to access the database the People Finder is using so the only way I can get at the information is by scraping the structure starting at the top brass at the top and then going through each tier in turn.

Each person has a Staff number which then forms the URL http://intranet/peoplefinder/index.aspx?srn=ABC1234 and then all the people who report to them are listed underneth in the format <a id="gvEmployees_ctl03_lnkFullName" href="index.aspx?srn=ABC4321" target="_self"> where each URL indicates the Staff number and provides a link to their team.

The trouble arises when the teams are big as paging is implemented in the GridView with an URL such as <a href="javascript:__doPostBack('gvEmployees','Page$2')">2</a>.

How would I scrape this page, capture the SRN and other details along with the people who report to the person on all pages of the GridView then loop through each reportee and do the same process until the whole list is complete?

Example HTML of result

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" >
<head><title>
    People Finder: Name Surname
</title><link rel="stylesheet" href="/path/to/style.css" type="text/css" /><link rel="stylesheet" href="/path/to/anotherStyle.css" type="text/css" />
    <script type="text/javascript" src="/path/to/peoplefinder.js"></script>
</head>
<body>
    <form name="form1" method="post" action="/path/to/index.aspx" id="form1">
<div>
<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" />
<input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value="" />
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="### ViewState ###" />
</div>

<script type="text/javascript">
<!--
var theForm = document.forms['form1'];
if (!theForm) {
    theForm = document.form1;
}
function __doPostBack(eventTarget, eventArgument) {
    if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
        theForm.__EVENTTARGET.value = eventTarget;
        theForm.__EVENTARGUMENT.value = eventArgument;
        theForm.submit();
    }
}
// -->
</script>


<script src="/path/to/WebResource.axd?d=AueXWrgAf8xSxMTAt1Q4AA2&amp;t=633311832634916698" type="text/javascript"></script>

        <div class="HP3CHeader">
            <div id="LWHPBanner">
                <h1><span id="lblName">Name Surname</span></h1>
            </div>
        </div>

        <div id='CPMain'>
            <div id="mainBox">

            <div id="pnlEmployeeDetails">

                <div id='basicData'>
                    <img id="imgPhoto" class="photo" src="/path/to/photo.jpg" style="height:69px;width:69px;border-width:0px;" />
                    <span id="lblBusinessUnit">Business Unit</span>
                    <span id="lblCostCentreName">Cost Centre</span>
                    <span id="lblLocation">Location</span>

                    <a href='/path/to/checkcontactdetails.htm' target='_blank' onclick='return OpenCheckContactDetails();' >Find out how to change your details/photo.</a>
                    <div id="manager">
        <strong>Reports to: </strong><a id="hlManager" href="/path/to/index.aspx?srn=ABC1234">Name Surname</a>
    </div>
                </div>

                <div id='contactData'>

                    <div id="pnlSrn">
        <strong>Staff number:</strong> <span id="lblSrn">ABC1234</span>
    </div>


                    <div id="pnlEmailAddress">
        <strong>Email Address:</strong> <span id="lblEmailAddress">Email</span>
    </div>
                    <div style="clear: both"></div>
                </div>

</div>

            <div id="pnlGrid">

                <h3><span id="lblGridTitle">Name's team</span></h3>
            <div>
        <table class="subordinates" cellspacing="0" cellpadding="2" rules="cols" border="1" id="gvEmployees" style="border-style:None;border-collapse:collapse;">
            <tr style="color:Black;background-color:#EFF3FB;border-style:None;font-weight:bold;">
                <th scope="col"><a href="javascript:__doPostBack('gvEmployees','Sort$SRN')" style="color:Black;">SRN</a></th><th scope="col"><a href="javascript:__doPostBack('gvEmployees','Sort$FullName')" style="color:Black;">Full name</a></th><th scope="col"><a href="javascript:__doPostBack('gvEmployees','Sort$RACFID')" style="color:Black;">RACFID</a></th>
            </tr><tr class="reports" style="background-color:White;border-style:None;">
                <td style="width:70px;">ABC1234</td><td>
                            <a id="gvEmployees_ctl02_lnkFullName" href="index.aspx?srn=1K5932" target="_self">Name Surname</a> 
                        </td><td>ABCD</td>
            </tr><tr class="reports" style="background-color:#EFF3FB;border-style:None;">
                <td style="width:70px;">ABC1234</td><td>
                            <a id="gvEmployees_ctl03_lnkFullName" href="/path/to/index.aspx?srn=ABC1234" target="_self">Name Surname</a> 
                        </td><td>ABCD</td>
            </tr><tr class="reports" style="background-color:White;border-style:None;">
                <td style="width:70px;">ABC1234</td><td>
                            <a id="gvEmployees_ctl04_lnkFullName" href="/path/to/index.aspx?srn=ABC1234" target="_self">Name Surname</a> 
                        </td><td>ABCD</td>
            </tr><tr class="reports" style="background-color:#EFF3FB;border-style:None;">
                <td style="width:70px;">ABC1234</td><td>
                            <a id="gvEmployees_ctl05_lnkFullName" href="/path/to/index.aspx?srn=ABC1234" target="_self">Name Surname</a> 
                        </td><td>ABCD</td>
            </tr><tr class="reports" style="background-color:White;border-style:None;">
                <td style="width:70px;">ABC1234</td><td>
                            <a id="gvEmployees_ctl06_lnkFullName" href="/path/to/index.aspx?srn=ABC1234" target="_self">Name Surname</a> 
                        </td><td>ABCD</td>
            </tr><tr class="reports" style="background-color:#EFF3FB;border-style:None;">
                <td style="width:70px;">ABC1234</td><td>
                            <a id="gvEmployees_ctl07_lnkFullName" href="/path/to/index.aspx?srn=ABC1234" target="_self">Name Surname</a> 
                        </td><td>ABCD</td>
            </tr><tr class="reports" style="background-color:White;border-style:None;">
                <td style="width:70px;">ABC1234</td><td>
                            <a id="gvEmployees_ctl08_lnkFullName" href="/path/to/index.aspx?srn=ABC1234" target="_self">Name Surname</a> 
                        </td><td>ABCD</td>
            </tr><tr class="reports" style="background-color:#EFF3FB;border-style:None;">
                <td style="width:70px;">ABC1234</td><td>
                            <a id="gvEmployees_ctl09_lnkFullName" href="/path/to/index.aspx?srn=ABC1234" target="_self">Name Surname</a> 
                        </td><td>ABCD</td>
            </tr><tr class="reports" style="background-color:White;border-style:None;">
                <td style="width:70px;">ABC1234</td><td>
                            <a id="gvEmployees_ctl10_lnkFullName" href="/path/to/index.aspx?srn=ABC1234" target="_self">Name Surname</a> 
                        </td><td>ABCD</td>
            </tr><tr class="reports" style="background-color:#EFF3FB;border-style:None;">
                <td style="width:70px;">ABC1234</td><td>
                            <a id="gvEmployees_ctl11_lnkFullName" href="/path/to/index.aspx?srn=ABC1234" target="_self">Name Surname</a> 
                        </td><td>ABCD</td>
            </tr><tr class="PagerStyle" style="color:#000039;border-style:None;">
                <td colspan="3"><table border="0">
                    <tr>
                        <td><span>1</span></td><td><a href="javascript:__doPostBack('gvEmployees','Page$2')" style="color:#000039;">2</a></td>
                    </tr>
                </table></td>
            </tr>
        </table>
    </div>

</div>
            </div>

            <div id="searchBox">
                <strong>Search People Finder:</strong>
                <br /><br />
                <span>Forename:</span><br/>
                <span><input name="txtFirstname" type="text" id="txtFirstname" /></span><br/>
                <span>Surname:</span><br/>
                <span><input name="txtSurname" type="text" id="txtSurname" /></span><br/>
                <span>RACFID:</span><br/>
                <span><input name="txtRacfid" type="text" id="txtRacfid" /></span><br/>
                <span>Staff number:</span><br/>
                <span><input name="txtSrn" type="text" id="txtSrn" /></span><br/>
                <div class="searchBoxItem" style="text-align:center;width:100%"><input type="submit" name="btnFind" value="Search" onclick="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(&quot;btnFind&quot;, &quot;&quot;, false, &quot;&quot;, &quot;index.aspx&quot;, false, false))" id="btnFind" title="Search for employees member" class="button" style="border-style:Outset;" /></div><br/> 
                <div>People Finder searches only UK staff.</div> 
               <!-- <div><a class="execBoardLink" href="/path/to/index.aspx?srn=ABC1234">Show Executive Board</a></div> -->
                <div style="margin-top:5px;"><a href="/path/to/phonebook" target="phoneBook" onclick='return OpenPhonebook();' title="Open Phonebook in new window">Open Phonebook</a></div>
            </div>
        </div>

        <div class="contentFooter"  style="text-align:center;">
            <table width="100%" cellpadding="0" cellspacing="0" border="0" summary="Navigation layout table">
                <tr>
                    <td align="left"><span class="linkArrow">&lt;</span> <a href="javascript:history.back();">Back</a></td>
                    <td align="center"></td>
                    <td align="right"><span class="linkArrow">^ </span><a href="#top">Top</a></td>
                </tr>
            </table>
        </div> 

<div>

    <input type="hidden" name="__PREVIOUSPAGE" id="__PREVIOUSPAGE" value="vy066Txz34y1E515UsTSTDabHKEmdBRCsq7xM0lpJls1" />
    <input type="hidden" name="__EVENTVALIDATION" id="__EVENTVALIDATION" value="/wEWCgKM3uTTAgLP/83pDwLfwaTTAQKNguzjCAKt98LeCwLZh62pDwKKqdGpBwLd2q7jAwKa+5aMBAL5zb65C42zY4GBEUKujhjtZ/hZ8sLESfiF" />
</div></form>
</body>
</html>

Solution

  • You could post a variable to the HTML page to go through the paging.

    string lcUrl = "http://www.mysite.com/page.aspx";
    
    HttpWebRequest loHttp =
    
       (HttpWebRequest) WebRequest.Create(lcUrl);
    
    
    // *** Send any POST data
    
    string lcPostData =
    
       "gvEmployees=" + HttpUtility.UrlEncode("Page$2");
    
    loHttp.Method="POST";
    
    byte [] lbPostBuffer = System.Text.           
    
                           Encoding.GetEncoding(1252).GetBytes(lcPostData);
    
    loHttp.ContentLength = lbPostBuffer.Length;
    
    Stream loPostData = loHttp.GetRequestStream();
    
    loPostData.Write(lbPostBuffer,0,lbPostBuffer.Length);
    
    loPostData.Close();
    
    HttpWebResponse loWebResponse = (HttpWebResponse) loHttp.GetResponse();
    
    Encoding enc = System.Text.Encoding.GetEncoding(1252);
    
    StreamReader loResponseStream =
    
       new StreamReader(loWebResponse.GetResponseStream(),enc);
    
    string lcHtml = loResponseStream.ReadToEnd();
    
    loWebResponse.Close();
    
    loResponseStream.Close();
    

    Then parse out the data you need from the string.

    --EDIT--

    Here is what I would try (something similar) where all of the post data is sent:

    string lcPostData =
    
           "__EVENTTARGET" + HttpUtility.UrlEncode("gvEmployees"); &
    "__EVENTARGUMENT" + HttpUtility.UrlEncode("Page%242"); &
    "__VIEWSTATE" + HttpUtility.UrlEncode("<Value of _Viewstate>");