OutOfMemory on custom extractor

I have stitched a lot of small XML files into one file, and then made a custom extractor to return rows with one byte array that corresponds to each file.

Run on remote/master
- Run it for one file (gzipped, 11Mb), it works fine.
- Run it for more than one file, I get a System.OutOfMemoryException.
Run on local/master
- Run it for one or more files (gzipped 500+ Mbs), works fine.

Extractor looks like this:

public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
    {

        using (var stream = new StreamReader(input.BaseStream))
        {
            var xml = stream.ReadToEnd();

            // Clean stiched XML
            xml = UtilsXml.CleanXml(xml);

            // Get nodes - one for each stiched file
            var d = new XmlDocument();
            d.LoadXml(xml);
            var root = d.FirstChild;

            for (int i = 0; i < root.ChildNodes.Count; i++)
            {
                output.Set<object>(1, Encoding.ASCII.GetBytes(root.ChildNodes[i].OuterXml.ToString()));
                yield return output.AsReadOnly();
            }

            yield break;
        }
    }

and error message looks like this:

==== Caught exception System.OutOfMemoryException

at System.Xml.XmlDocument.CreateTextNode(String text)
at System.Xml.XmlLoader.LoadAttributeNode()
at System.Xml.XmlLoader.LoadNode(Boolean skipOverWhitespace)
at System.Xml.XmlLoader.LoadDocSequence(XmlDocument parentDoc)
at System.Xml.XmlDocument.Load(XmlReader reader)
at System.Xml.XmlDocument.LoadXml(String xml)
at Microsoft.Analytics.Tools.Formats.Text.XmlByteArrayRowExtractor.<Extract>d__0.MoveNext()
at ScopeEngine.SqlIpExtractor<ScopeEngine::GZipInput,Extract_0_Data0>.GetNextRow(SqlIpExtractor<ScopeEngine::GZipInput\,Extract_0_Data0>* , Extract_0_Data0* output) in d:\data\ccs\jobs\bc367467-ef86-43d2-a937-46ba2d4cc524_v0\sqlmanaged.h:line 1924

So what am I doing wrong? And how do I debug this on remote?

Thanks!

Solution

Unfortunately local run does not enforce memory allocations, so you would have to check memory in local vertex debug yourself.

Looking at your code above, I see that you are loading XML documents into a DOM. Please note that an XML DOM can explode the data size from the string representation up to a factor of 10 or more (I have seen 2 to 12 in my times as the resident SQL XML guru).

Each UDO today only gets 1/2 GB of RAM to play with. So what I assume is that your XML DOM document(s) start going beyond that.

The recommendation normally is that you use the XMLReader interface (there is a reader extractor in the samples on http://usql.io as well) and scan through the document(s) to find the information you are looking for.

If your documents are always small enough (e.g., <20MB), you may want to make sure that you release the memory of the other documents and operate one document at a time.

We do have plans to allow you to annotate your UDO with memory needs, but that is still a bit out.