I have an application that needs to download over http lots (>10k) of large xml files (8-10MB), get some content in it using one xpath expression.
I'm wondering how this process can be optimized. These xml files will go directly into Large Object Heap. I'm thinking about three options: - Overall optimization: download xml files using separate IO thread pool - Use streams to read web response with xml file instead of reading into string which will go to LOH (not sure if it's possible and how to so that) - Use Regex to retrieve content from XML as XPath is pretty simple and I don't need full DOM support for it.
Are there any other options?
There are lots of options for optimization, depending on what you want to maximize.
If your processing is faster than download (and it's hard to imagine that your XPath-based search will be slow), your limiting factor will be download speed. You can use asynchronous requests to download multiple files at a time, but if all the files are coming from the same server it's unlikely that more than a handful of concurrent downloads will give you any performance increase.
You could create an XmlReader
from the stream while you're downloading, and (I think, although I'm not sure) run your XPath expression against the stream. But that doesn't really give you any benefit.
I think you're unnecessarily worried about the large object heap. If you're downloading and processing one file at a time, each string will go into the LOH, get processed, and then be collected. Yes, there's the potential of fragmenting your large object heap, but if the files are all in the 8 to 10 MB range, it's highly unlikely in practice that you will have a problem. There would have to be a pathological arrangement of files.
And you don't really have to download to a string. You can pre-allocate a buffer of, say, 20 MB, and download to that buffer. Then wrap a MemoryStream
areound it, create an XmlReader
on that memory stream, etc. So your LOH won't get fragmented at all because you just re-use that 20 MB buffer. I really wouldn't go this route unless I absolutely had to, though.
Were I assigned this task, I'd do it in the simplest way possible. The limiting factor is going to be the download speed, so that's where I'd concentrate any optimization efforts. I wouldn't worry at all about potential LOH fragmentation, but keep the alternate solution in my back pocket just in case that crops up as a problem.
How you approach this really depends on how fast that XPath search is. If it takes milliseconds or even a few seconds to search a 10 MB XML file, then it makes no sense at all to worry about optimizing the search: the download time is going to dwarf the search time. Instead, I'd see if I could get two or four concurrent downloads, throw each string result into a BlockingCollection
when it comes in, and have a consumer thread reading that queue and running the search. That consumer thread will probably spend a lot of its time idle, waiting for the next file to come down.
In short: make it work, then make it work fast.