E_RUNTIME_USER_STRINGTOOBIG xml pasing

Trying to load XML through Azure Data Lake jobs I faced unbreakable issues with E_RUNTIME_USER_STRINGTOOBIG using Microsoft.Analytics.Samples.Formats

The thing is that XmlDomExtractor the only way to join (or rather apply) elements from inside the same XML file (One-To-Many).

@xml =
EXTRACT Id string,
        //...
        Products string
FROM @"/pathToXml.xml"
USING new Microsoft.Analytics.Samples.Formats.Xml.XmlDomExtractor("contract",
      new SQL.MAP<string, string>{
      {"id", "Id"},
      //...
      // this is actually a node that may contain thousand of child node
      {"products", "Products"}
      });

Apply:

@data = SELECT c.Id,
        //....
        pp.ProductSid, 
        // ... other product properties
    FROM @prepareData AS c
    OUTER APPLY
        new Microsoft.Analytics.Samples.Formats.Xml.XmlApplier ("Products","/",new SQL.MAP<string, string>{
      {"/abc/p/s", "ProductSid"},
      //....
      }) AS pp(
            ProductSid string, 
            /*....*/
              );

The full version of the code is here

I've tried to minimize my XML node by replacing names by letter. Unfortunately, it didn't help because of thousands of the items inside (+ long names of the products) broke through the limitation anyway.

Solution

there is no actual solution to cope with this issue so far. But I've found only two ways to get around:

The first, minimize your XML or JSON. In case of XML, check namespaces. Getting rid of them may help ( in my case it fixed most of the 'too-long' value)

The second is write your own extractor (which is not that hard).

I used both approaches and here is the implementation of my own parser. Here is the usage example. Please, feel free to use it.

It's pretty generic and supports XPath and namespaces as well. In terms of performance, it's pretty similar to Microsoft's implementation

Performance

During my investigation of this issue, I found that both parsers work extremely slow processing many small files.

However, making too big files (~2GBs) causes the OutOfMemory exception. My assumption here is that it happens because of files are processed as Atomic (the file isn't distributed by 250 MB chunks which are processed independently like CSV or TSV). As for my case, files size 200-250 MB is optimal