Search code examples
c#objectlistview

Implement IVirtualListDataSource when object count is unknown


Im implementing a log file viewer with ObjectListView, to be precise my class of choice is VirtualObjectListView.

On the constructor I assign an implementation of the IVirtualListDataSource interface to the VirtualListDataSource:

public LogWindow(List<String> logFiles)
{
    InitializeComponent();

    // LogSource implements IVirtualListDataSource
    OLV_Log.VirtualListDataSource = new LogSource(logFiles);
}

The file(s) I'm processing varies from a few lines to millions of lines so I thought that using a virtual list was the way to go, my problem is that I don't know the numer of lines until I fully read the file which takes a long time for big files.

Each line is taken from the log files using a yield statement:

internal class LogSource : IVirtualListDataSource
{
    // ...

    public class LogLine { /* whatever */ }

    // ...

    private IEnumerable<LogLine> Read()
    {
        foreach (var path in m_logFiles)
        {
            using var fileStream = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.Read, 0x1000, FileOptions.Asynchronous | FileOptions.SequentialScan);
            using var streamReader = new StreamReader(fileStream);

            for (string? line = String.Empty; line != null; line = streamReader.ReadLine())
            {
                if (!String.IsNullOrEmpty(line))
                {
                    // process text...

                    var logLine = new LogLine(/* whatever */);

                    // do things...

                    yield return logLine;
                }
            }

            yield break;
    }

    // ...

}

And added to "cache" on demand:

internal class LogSource : IVirtualListDataSource
{
    // ...

    public class LogLine { /* whatever */ }

    private readonly List<LogLine> m_logLines = new();

    // ...

    public object GetNthObject(int index)
    {
        int offset = index - m_logLines.Count + 1;

        if (offset > 0)
            m_logLines.AddRange(Read().Take(offset));

        return m_logLines[index];
    }

    // ...

    public void PrepareCache(int first, int last)
    {
        GetNthObject(last);
    }

    // ...
}

So, as I don't know beforehand how many lines exists I don't know what to return from LogSource.GetObjectCount(), here is what I've tried so far:

  1. Return an arbitrary number.
    • Returning a small number (say 500) works as long as the log(s) file(s) contains at least that number of lines, any line count below it causes an (expected) exception at the return m_logLines[index]; instruction while any line count above truncates the result.
    • Returning int.MaxValue behaves as if there were no lines at all (weird!).
  2. Return a guess based on the size of the files: Let's say that I consider an average of 75 characters per line so 750 bytes of log files would equal roughly to 10 lines.
    • Same problems as previous point.
  3. Update line number dynamically.
    • If I return m_logLines.Count from GetObjectCount my VirtualObjectListView is not filled since the object count is queried before adding any element to m_logLines so it is 0 and there's no call to GetNthObject nor PrepareCache.

So, hoy should I use a VirtualObjectListView for it to update the line number dynamically? What should I return from GetObjectCount when I don't know the object count?

Also, any improvement on my code is more than wellcome.


Solution

  • [Update]

    I have created Gigantor which is a better and more general solution to the problem of counting lines in very large files. It also includes efficient regular expression searches for very large files. It works by partitioning the file into chunks which are processed in parallel by a pool of worker threads and ultimately consolidated into a single continuous result. On my test machine I got rates up to about 3.4 GBytes/s.

    [Original Answer]

    I found this ObjectListView but couldn't easily find the definition for IVirtualListDataSource and was too lazy to search hard. So some of my answer is how I think that interface should work based on experience (ie. hubris).

    I'll get to your main question in a minute, but First, I think PrepareCache and GetNthObject are behaving badly. Calls to GetNthObject are reading log lines 0 - N, storing them all in memory as m_logLines, and then throwing away almost everything and selecting only the one that is needed each time the view cache is changed. This approach will be slow and run out of memory for large amounts of log data (which I assume you have).

    I think you want PrepareCache to go grab the log lines specified by first and last from the log files and just store those lines in memory. Then calls to GetNthObject should return lines already cached in memory by prior call to PrepareCache.

    Here are some tweaks I made to your LogSource class to facilitate the rest of the discussion.

    class LogSource : IVirtualListDataSource {
        public class LogLine {
            public LogLine(string text) {}
        }
        internal int m_objectCount;
        internal List<string> m_logFiles;
        internal List<LogLine> m_cache;
        internal struct LineData {
            public string Path;
            public int StartLine;
            public int EndLine; };
        internal ConcurrentBag<LineData> m_lineData;
        internal int m_cacheStartLine;
        internal int m_lastIndex;
        internal long m_lastFpos;
        internal Thread m_initThread;
    
        LogSource(List<string> logFiles)
        {
            m_logFiles = logFiles;
        }
    

    We need something that can gradually build up the knowledge about which file/line a virtual index references in the background. As this knowledge is built in the background the user should be able to gradually access more and more log data. This Initialize function can do that when called in the background (see InitializeInBackground later in this post). The idea is to create an index of all the files that easily fits into memory. We do not try to store the log data itself because it won't fit. This index could be improved and optimized by tracking more positions in the file, but I chose to keep it pretty simple and just track the start and end of each file.

        // Map lines from all log files to index,
        // This can take a while depending on the amount of log data,
        // intended to be called from InitializeInBackground (not directly)
        private void Initialize(VirtualObjectListView view)
        {
            m_lineData = new();
            m_objectCount = 0;
            foreach (var path in m_logFiles) {
                var endLine = m_objectCount;
                using var fileStream = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.Read, 0x1000, FileOptions.Asynchronous | FileOptions.SequentialScan);
                using var streamReader = new StreamReader(fileStream);
                while (streamReader.ReadLine() != null) {
                    endLine++;
                }
                m_lineData.Add(new LineData() { Path = path, StartLine = m_objectCount, EndLine = endLine });
                m_objectCount = endLine;
                // Update virtual list size
                view.UpdateVirtualListSize();
                // Give up the rest of our time slice
                Thread.Sleep(0);
            }
        }
    

    Now to the your main question. Notice how the line at the end of the prior code block calls VirtualObjectListView.UpdateVirtualListSize each time the size is updated. This calls the GetObjectCount method of your virtual data source (shown below) which simply returns the current size, and this is why the Initialize method has a dependency on the VirtualObjectListView.

        public int GetObjectCount()
        {
            return m_objectCount;
        }
    

    The function below is a helper function called by PrepareCache to map index to a log line. It will return the log line if Initialize has progressed far enough or null until it has.

        // Return the line mapped to the virtual index or null if index is out of range
        internal string ReadLine(int index)
        {
            string text = null;
            foreach (var lineData in m_lineData) {
                if (index >= lineData.StartLine && index <= lineData.EndLine) {
                    using var fileStream = new FileStream(lineData.Path, FileMode.Open, FileAccess.Read, FileShare.Read, 0x1000, FileOptions.Asynchronous | FileOptions.SequentialScan);
                    using var streamReader = new StreamReader(fileStream);
                    if (index - m_lastIndex == 1 &&
                        m_lastIndex >= lineData.StartLine &&
                        m_lastIndex <= lineData.EndLine) {
                        // continuation read, continue where we left off
                        fileStream.Position = m_lastFpos;
                        text = streamReader.ReadLine();
                    }
                    else {
                        // not a continuation read, find the line
                        var line = index - lineData.StartLine;
                        do {
                            text = streamReader.ReadLine();
                        } while (line-- > 0);
                    }
                    m_lastFpos = fileStream.Position;
                    m_lastIndex = index;
                    return text;
                }
            }
            return text;
        }
    

    Get objects from the cache.

        // Return the LogLine mapped to the virtual index or null if out of range
        public object GetNthObject(int index)
        {
            var cacheIndex = index - m_cacheStartLine;
            if (cacheIndex >=0 && cacheIndex < m_cache.Count) {
                return m_cache[cacheIndex];
            }
            return null;
        }
    

    Prepare the cache

        // Prepare the cache to map to the requested range
        public void PrepareCache(int first, int last)
        {
            m_cacheStartLine = first;
            m_cache = new();  // naively just destroy everything and start over
            for (var i=first; i<=last; i++) {
                var text = ReadLine(i);
                if (text == null) {
                    break;
                }
                else {
                    m_cache.Add(new LogLine(text));
                }
            }
        }
    }
    

    Below is an example of how to run Initialize as a background thread to allow the application to remain responsive while the log files are being processed.

        public void InitializeInBackground(VirtualListDataSource view)
        {
            m_initThread = new Thread(new ThreadStart(() => Initialize(view)));
            m_initThread.IsBackground = true;
            m_initThread.Start();
        }
    

    The code in this post has not been tested.