I have inherited a web spider application, with all source code. It appears that for normal brochure style websites (say under 15 pages), the software runs perfectly fine.
For others, (over 20ish pages), the software throws the StackOverflowException on the line marked in the code below.
It does not appear to be utilitzing recursion, and unfortunately, there is no support for the LinqToHtml (SuperStarCoders) library being used.
Here is the code that is running when the exception occurs:
Private Function ExportXml(Optional ByVal _Worker As ComponentModel.BackgroundWorker = Nothing) As Boolean
Dim _L = PopulateSEOList(_Worker)
Try
Dim _TmpStr As New Text.StringBuilder
Dim _X As New XDocument, _ct As Long = 0, _Elements As Typing.SEO.Elements = Nothing
ReportProgress(0, _Worker)
With _TmpStr
.Append("<?xml version=""1.0"" encoding=""UTF-8""?>")
.Append("<o7th.Web.Design.Web.Spider>")
For i As Long = 0 To _L.Count - 1
_ct += 1
.Append(" <Page>")
.Append(" <Link>" & XmlEscape(_L(i).Link) & "</Link>")
.Append(" <Title>" & XmlEscape(_L(i).Title) & "</Title>")
.Append(" <Keywords>" & XmlEscape(_L(i).Keywords) & "</Keywords>")
.Append(" <Description>" & XmlEscape(_L(i).Description) & "</Description>")
.Append(" <Elements>")
_Elements = _L(i).ContentElements
If _Elements IsNot Nothing Then
If _Elements.H1 IsNot Nothing Then
.Append(<H1>
<%= (From n In _Elements.H1.AsParallel()
Select
<Content><%= XmlEscape(n) %></Content>).ToList() %>
</H1>)
End If
If _Elements.H2 IsNot Nothing Then
.Append(<H2>
<%= (From n In _Elements.H2.AsParallel()
Select
<Content><%= XmlEscape(n) %></Content>).ToList() %>
</H2>)
End If
If _Elements.H3 IsNot Nothing Then
.Append(<H3>
<%= (From n In _Elements.H3.AsParallel()
Select
<Content><%= XmlEscape(n) %></Content>).ToList() %>
</H3>)
End If
If _Elements.H4 IsNot Nothing Then
.Append(<H4>
<%= (From n In _Elements.H4.AsParallel()
Select
<Content><%= XmlEscape(n) %></Content>).ToList() %>
</H4>)
End If
If _Elements.H5 IsNot Nothing Then
.Append(<H5>
<%= (From n In _Elements.H5.AsParallel()
Select
<Content><%= XmlEscape(n) %></Content>).ToList() %>
</H5>)
End If
If _Elements.H6 IsNot Nothing Then
.Append(<H6>
<%= (From n In _Elements.H6.AsParallel()
Select
<Content><%= XmlEscape(n) %></Content>).ToList() %>
</H6>)
End If
If _Elements.UL IsNot Nothing Then
.Append(<UL>
<%= (From n In _Elements.UL.AsParallel()
Select
<Content><%= ConvertToCDATA(n) %></Content>).ToList() %>
</UL>)
End If
If _Elements.OL IsNot Nothing Then
.Append(<OL>
<%= (From n In _Elements.OL.AsParallel()
Select
<Content><%= ConvertToCDATA(n) %></Content>).ToList() %>
</OL>)
End If
If _Elements.STRONG IsNot Nothing Then
.Append(<STRONG>
<%= (From n In _Elements.STRONG.AsParallel()
Select
<Content><%= XmlEscape(n) %></Content>).ToList() %>
</STRONG>)
End If
If _Elements.EM IsNot Nothing Then
.Append(<EM>
<%= (From n In _Elements.EM.AsParallel()
Select
<Content><%= XmlEscape(n) %></Content>).ToList() %>
</EM>)
End If
If _Elements.BLOCKQUOTE IsNot Nothing Then
.Append(<BLOCKQUOTE>
<%= (From n In _Elements.BLOCKQUOTE.AsParallel()
Select
<Content><%= ConvertToCDATA(n) %></Content>).ToList() %>
</BLOCKQUOTE>)
End If
If _Elements.A IsNot Nothing Then
.Append(<LINKS>
<%= (From n In _Elements.A.AsParallel()
Select
<Content>
<HREF><%= XmlEscape(n.Href) %></HREF>
<REL><%= XmlEscape(n.Rel) %></REL>
<TITLE><%= XmlEscape(n.Title) %></TITLE>
<TARGET><%= XmlEscape(n.Target) %></TARGET>
<CONTENT><%= XmlEscape(n.Content) %></CONTENT>
</Content>).ToList() %>
</LINKS>)
End If
If _Elements.IMG IsNot Nothing Then
.Append(<IMAGES>
<%= (From n In _Elements.IMG.AsParallel()
Select
<Content>
<SRC><%= XmlEscape(n.Source) %></SRC>
<ALT><%= XmlEscape(n.Alt) %></ALT>
<TITLE><%= XmlEscape(n.Title) %></TITLE>
</Content>).ToList() %>
</IMAGES>)
End If
End If
.Append(" </Elements>")
.Append(" <Content><![CDATA[" & _L(i).Content.ToString() & "]]></Content>")
.Append(" </Page>")
ReportProgress((_ct / _L.Count) * 100, _Worker)
Next
.Append("</o7th.Web.Design.Web.Spider>")
End With
Dim _xStr As String = _TmpStr.ToString()
_X = XDocument.Parse(_xStr)
_X.Save(ExportPath & "site.xml")
_X = Nothing
ReportProgress(100, _Worker)
Return True
Catch ex As Exception
'Put logging in here
Message = ex.Message & ":::Export.ExportXml"
Return False
End Try
End Function
The LinkList variable above is a list(of Typing.Links):
Partial Public Class Links
Public Property SiteUrl As String
Public Property SiteTitle As String
Public Property Site As String
End Class
The other 2 lists are:
Imports Superstar.Html.Linq
Public Class Typing
Partial Public Class SEO
Public Property Link As String
Public Property Title As String
Public Property Description As String
Public Property Keywords As String
Public Property Content As HElement
Public Property ContentElements As Elements
Partial Public Class Elements
Public Property H1 As List(Of String)
Public Property H2 As List(Of String)
Public Property H3 As List(Of String)
Public Property H4 As List(Of String)
Public Property H5 As List(Of String)
Public Property H6 As List(Of String)
Public Property UL As List(Of String)
Public Property OL As List(Of String)
Public Property STRONG As List(Of String)
Public Property BLOCKQUOTE As List(Of String)
Public Property EM As List(Of String)
Public Property A As List(Of Links)
Public Property IMG As List(Of Images)
Partial Public Class Images
Public Property Source As String
Public Property Alt As String
Public Property Title As String
End Class
Partial Public Class Links
Public Property Href As String
Public Property Rel As String
Public Property Title As String
Public Property Target As String
Public Property Content As String
End Class
End Class
End Class
End Class
ReportProgress simply reports and updates the backgroundworker of the Xaml window for this particual circumstance to update a progress bar:
Public Sub ReportProgress(ByVal ct As Integer, _Worker As ComponentModel.BackgroundWorker)
If _Worker IsNot Nothing Then
_Worker.ReportProgress(ct)
Threading.Thread.Sleep(500)
End If
End Sub
, and the Downloader class is:
Imports System.Reflection
Imports System.Net
Imports Superstar.Html.Linq
Public Class Downloader
Implements IDisposable
''' <summary>
''' Get the returned downloaded string
''' </summary>
''' <value></value>
''' <returns></returns>
''' <remarks></remarks>
Public ReadOnly Property ReturnString As String
Get
Return _StrReturn
End Get
End Property
Private Property _StrReturn As String
''' <summary>
''' Get the returned downloaded byte array
''' </summary>
''' <value></value>
''' <returns></returns>
''' <remarks></remarks>
Public ReadOnly Property ReturnBytes As Byte()
Get
Return _FSReturn
End Get
End Property
Private Property _FSReturn As Byte()
Private Property _UserAgent As String = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13"
Private Property DataReceived As Boolean = False
''' <summary>
''' Download a string, but do not block the calling thread
''' </summary>
''' <param name="_Path"></param>
''' <remarks></remarks>
Public Sub DownloadString(ByVal _Path As String, Optional ByVal _Worker As ComponentModel.BackgroundWorker = Nothing)
SetAllowUnsafeHeaderParsing20()
Using wc As New Net.WebClient()
With wc
Dim _ct As Long = 0
DataReceived = False
.Headers.Add("user-agent", _UserAgent)
.DownloadStringAsync(New System.Uri(_Path))
AddHandler .DownloadStringCompleted, AddressOf StringDownloaded
Do While Not DataReceived
If _Worker IsNot Nothing Then
_ct += 1
ReportProgress(_ct, _Worker)
End If
Loop
End With
End Using
End Sub
''' <summary>
''' Download a file, but do not block the calling thread
''' </summary>
''' <param name="_Path"></param>
''' <remarks></remarks>
Public Sub DownloadFile(ByVal _Path As String, Optional ByVal _Worker As ComponentModel.BackgroundWorker = Nothing)
SetAllowUnsafeHeaderParsing20()
Using wc As New Net.WebClient()
With wc
Dim _ct As Long = 0
DataReceived = False
.Headers.Add("user-agent", _UserAgent)
.DownloadDataAsync(New System.Uri(_Path))
AddHandler .DownloadDataCompleted, AddressOf FileStreamDownload
Do While Not DataReceived
If _Worker IsNot Nothing Then
_ct += 1
ReportProgress(_ct, _Worker)
End If
Loop
End With
End Using
End Sub
''' <summary>
''' Download a parsable HDocument, for using HtmlToLinq
''' </summary>
''' <param name="_Path"></param>
''' <returns></returns>
''' <remarks></remarks>
Public Function DownloadHDoc(ByVal _Path As String, Optional ByVal _Worker As ComponentModel.BackgroundWorker = Nothing) As HDocument
Try
'StackOverFlowException Occurring Here!
DownloadString(_Path, _Worker)
Return HDocument.Parse(_StrReturn)
Catch soex As StackOverflowException
'put some logging in here, with the path attempted
Return Nothing
Catch ex As Exception
SetAllowUnsafeHeaderParsing20()
Return HDocument.Load(_Path)
End Try
End Function
#Region "Internals"
Private Sub SetAllowUnsafeHeaderParsing20()
Dim a As New System.Net.Configuration.SettingsSection
Dim aNetAssembly As System.Reflection.Assembly = Assembly.GetAssembly(a.GetType)
Dim aSettingsType As Type = aNetAssembly.GetType("System.Net.Configuration.SettingsSectionInternal")
Dim args As Object() = Nothing
Dim anInstance As Object = aSettingsType.InvokeMember("Section", BindingFlags.Static Or BindingFlags.GetProperty Or BindingFlags.NonPublic, Nothing, Nothing, args)
Dim aUseUnsafeHeaderParsing As FieldInfo = aSettingsType.GetField("useUnsafeHeaderParsing", BindingFlags.NonPublic Or BindingFlags.Instance)
aUseUnsafeHeaderParsing.SetValue(anInstance, True)
End Sub
Private Sub FileStreamDownload(ByVal sender As Object, ByVal e As DownloadDataCompletedEventArgs)
If e.Cancelled = False AndAlso e.Error Is Nothing Then
DataReceived = True
_FSReturn = DirectCast(e.Result, Byte())
Else
_FSReturn = Nothing
End If
End Sub
Private Sub StringDownloaded(ByVal sender As Object, ByVal e As DownloadStringCompletedEventArgs)
If e.Cancelled = False AndAlso e.Error Is Nothing Then
DataReceived = True
_StrReturn = DirectCast(e.Result, String)
Else
_StrReturn = String.Empty
End If
End Sub
#End Region
#Region "IDisposable Support"
Private disposedValue As Boolean ' To detect redundant calls
' IDisposable
Protected Overridable Sub Dispose(disposing As Boolean)
If Not Me.disposedValue Then
If disposing Then
End If
_StrReturn = Nothing
_FSReturn = Nothing
End If
Me.disposedValue = True
End Sub
Public Sub Dispose() Implements IDisposable.Dispose
Dispose(True)
GC.SuppressFinalize(Me)
End Sub
#End Region
End Class
As I said above, it does not look like there is any recursion happenning. (at least none that truly stick out at me), so I immediatly assume that it is within the HDocument.Parse that it is happenning.
Can you tell me where this is wrong, and how to correct the issue?
I have done some research, and understand that the default stack size is only 1MB, so I wonder if this is truly one of those special circumstances where I should attempt to increase this...
I found after watching the trace a number of times, that it always occurred when it hit a particular page. This page, just so happens to be over 500k in size.
Here is the Call Stack:
[External Code]
> o7th.Web.Design.Spider.Worker.dll!o7th.Web.Design.Spider.Worker.Downloader.DownloadHDoc(String _Path, System.ComponentModel.BackgroundWorker _Worker) Line 95 + 0x1e bytes Basic
o7th.Web.Design.Spider.Worker.dll!o7th.Web.Design.Spider.Worker.Export.PopulateSEOList(System.ComponentModel.BackgroundWorker _Worker) Line 513 + 0x65 bytes Basic
o7th.Web.Design.Spider.Worker.dll!o7th.Web.Design.Spider.Worker.Export.ExportXml(System.ComponentModel.BackgroundWorker _Worker) Line 70 + 0x1e bytes Basic
o7th.Web.Design.Spider.Worker.dll!o7th.Web.Design.Spider.Worker.Export.RunExport(System.ComponentModel.BackgroundWorker _Worker) Line 30 + 0x17 bytes Basic
o7th.Web.Design.WebSpider.exe!o7th.Web.Design.WebSpider.ParseLinks.RunExport(Object sender, System.ComponentModel.DoWorkEventArgs e) Line 106 + 0x2c bytes Basic
[External Code]
And Locals shows me the page I mention above that is over 500k in size
(I needed more space otherwise I would have added this as a comment to @Jakub Konecki's post.)
I've built several spiders over the years and the only big performance gain for parallelism is the actual downloading of URLs. You might shave a couple of hundred milliseconds of HTML parsing on large documents but the gain isn't worth the debugging price. So make your life easier and remove the parallelism.
You've also got a weird async blocking problem. In DownloadHDoc
you're calling DownloadString
synchronously but then inside of DownloadString
you're kicking off an async method and then blocking on a bit flag thus defeating the purpose of the async. What's worse is that you're blocking in a do-while
loop which is spinning at a million miles per hour and calling ReportProgress
every time. I expect this is what's actually giving you the SOE. Putting a Thread.Sleep(100)
in there might help you for starters.
[EDIT]
The code that is blocking on the bit flag is this:
.DownloadStringAsync(New System.Uri(_Path))
AddHandler .DownloadStringCompleted, AddressOf StringDownloaded
Do While Not DataReceived
If _Worker IsNot Nothing Then
_ct += 1
ReportProgress(_ct, _Worker)
End If
Loop
Line 1 kicks off an async method, line 2 adds a handler for the completion and returns immediately. Line 3 is checking a global variable over and over and over waiting for the function StringDownloaded
to set it. This is happening hundreds or thousands (or more) of times every second. Although not optimal, what makes it bad is that you are calling ReportProgress
method every time. The larger the document the more calls to ReportProgress
will be made. You really only need to update the UI every 100ms at most, I usually set mine to every 250ms or 500ms.
[EDIT 2]
If the above was the problem you should be able to change it to something like:
.DownloadStringAsync(New System.Uri(_Path))
AddHandler .DownloadStringCompleted, AddressOf StringDownloaded
Do While Not DataReceived
If _Worker IsNot Nothing Then
_ct += 1
ReportProgress(_ct, _Worker)
End If
Thread.Sleep(250) ''//Sleep inside of the loop
Loop