Im implementing a log file viewer with ObjectListView
, to be precise my class of choice is VirtualObjectListView
.
On the constructor I assign an implementation of the IVirtualListDataSource
interface to the VirtualListDataSource
:
public LogWindow(List<String> logFiles)
{
InitializeComponent();
// LogSource implements IVirtualListDataSource
OLV_Log.VirtualListDataSource = new LogSource(logFiles);
}
The file(s) I'm processing varies from a few lines to millions of lines so I thought that using a virtual list was the way to go, my problem is that I don't know the numer of lines until I fully read the file which takes a long time for big files.
Each line is taken from the log files using a yield
statement:
internal class LogSource : IVirtualListDataSource
{
// ...
public class LogLine { /* whatever */ }
// ...
private IEnumerable<LogLine> Read()
{
foreach (var path in m_logFiles)
{
using var fileStream = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.Read, 0x1000, FileOptions.Asynchronous | FileOptions.SequentialScan);
using var streamReader = new StreamReader(fileStream);
for (string? line = String.Empty; line != null; line = streamReader.ReadLine())
{
if (!String.IsNullOrEmpty(line))
{
// process text...
var logLine = new LogLine(/* whatever */);
// do things...
yield return logLine;
}
}
yield break;
}
// ...
}
And added to "cache" on demand:
internal class LogSource : IVirtualListDataSource
{
// ...
public class LogLine { /* whatever */ }
private readonly List<LogLine> m_logLines = new();
// ...
public object GetNthObject(int index)
{
int offset = index - m_logLines.Count + 1;
if (offset > 0)
m_logLines.AddRange(Read().Take(offset));
return m_logLines[index];
}
// ...
public void PrepareCache(int first, int last)
{
GetNthObject(last);
}
// ...
}
So, as I don't know beforehand how many lines exists I don't know what to return from LogSource.GetObjectCount()
, here is what I've tried so far:
return m_logLines[index];
instruction while any line count above truncates the result.int.MaxValue
behaves as if there were no lines at all (weird!).return m_logLines.Count
from GetObjectCount
my VirtualObjectListView
is not filled since the object count is queried before adding any element to m_logLines
so it is 0
and there's no call to GetNthObject
nor PrepareCache
.So, hoy should I use a VirtualObjectListView
for it to update the line number dynamically? What should I return from GetObjectCount
when I don't know the object count?
Also, any improvement on my code is more than wellcome.
[Update]
I have created Gigantor which is a better and more general solution to the problem of counting lines in very large files. It also includes efficient regular expression searches for very large files. It works by partitioning the file into chunks which are processed in parallel by a pool of worker threads and ultimately consolidated into a single continuous result. On my test machine I got rates up to about 3.4 GBytes/s.
[Original Answer]
I found this ObjectListView
but couldn't easily find the definition for IVirtualListDataSource
and was too lazy to search hard. So some of my answer is how I think that interface should work based on experience (ie. hubris).
I'll get to your main question in a minute, but First, I think PrepareCache
and GetNthObject
are behaving badly. Calls to GetNthObject
are reading log lines 0 - N, storing them all in memory as m_logLines
, and then throwing away almost everything and selecting only the one that is needed each time the view cache is changed. This approach will be slow and run out of memory for large amounts of log data (which I assume you have).
I think you want PrepareCache
to go grab the log lines specified by first
and last
from the log files and just store those lines in memory. Then calls to GetNthObject
should return lines already cached in memory by prior call to PrepareCache
.
Here are some tweaks I made to your LogSource class to facilitate the rest of the discussion.
class LogSource : IVirtualListDataSource {
public class LogLine {
public LogLine(string text) {}
}
internal int m_objectCount;
internal List<string> m_logFiles;
internal List<LogLine> m_cache;
internal struct LineData {
public string Path;
public int StartLine;
public int EndLine; };
internal ConcurrentBag<LineData> m_lineData;
internal int m_cacheStartLine;
internal int m_lastIndex;
internal long m_lastFpos;
internal Thread m_initThread;
LogSource(List<string> logFiles)
{
m_logFiles = logFiles;
}
We need something that can gradually build up the knowledge about which file/line a virtual index references in the background. As this knowledge is built in the background the user should be able to gradually access more and more log data. This Initialize
function can do that when called in the background (see InitializeInBackground
later in this post). The idea is to create an index of all the files that easily fits into memory. We do not try to store the log data itself because it won't fit. This index could be improved and optimized by tracking more positions in the file, but I chose to keep it pretty simple and just track the start and end of each file.
// Map lines from all log files to index,
// This can take a while depending on the amount of log data,
// intended to be called from InitializeInBackground (not directly)
private void Initialize(VirtualObjectListView view)
{
m_lineData = new();
m_objectCount = 0;
foreach (var path in m_logFiles) {
var endLine = m_objectCount;
using var fileStream = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.Read, 0x1000, FileOptions.Asynchronous | FileOptions.SequentialScan);
using var streamReader = new StreamReader(fileStream);
while (streamReader.ReadLine() != null) {
endLine++;
}
m_lineData.Add(new LineData() { Path = path, StartLine = m_objectCount, EndLine = endLine });
m_objectCount = endLine;
// Update virtual list size
view.UpdateVirtualListSize();
// Give up the rest of our time slice
Thread.Sleep(0);
}
}
Now to the your main question. Notice how the line at the end of the prior code block calls VirtualObjectListView.UpdateVirtualListSize
each time the size is updated. This calls the GetObjectCount
method of your virtual data source (shown below) which simply returns the current size, and this is why the Initialize
method has a dependency on the VirtualObjectListView
.
public int GetObjectCount()
{
return m_objectCount;
}
The function below is a helper function called by PrepareCache
to map index
to a log line. It will return the log line if Initialize
has progressed far enough or null until it has.
// Return the line mapped to the virtual index or null if index is out of range
internal string ReadLine(int index)
{
string text = null;
foreach (var lineData in m_lineData) {
if (index >= lineData.StartLine && index <= lineData.EndLine) {
using var fileStream = new FileStream(lineData.Path, FileMode.Open, FileAccess.Read, FileShare.Read, 0x1000, FileOptions.Asynchronous | FileOptions.SequentialScan);
using var streamReader = new StreamReader(fileStream);
if (index - m_lastIndex == 1 &&
m_lastIndex >= lineData.StartLine &&
m_lastIndex <= lineData.EndLine) {
// continuation read, continue where we left off
fileStream.Position = m_lastFpos;
text = streamReader.ReadLine();
}
else {
// not a continuation read, find the line
var line = index - lineData.StartLine;
do {
text = streamReader.ReadLine();
} while (line-- > 0);
}
m_lastFpos = fileStream.Position;
m_lastIndex = index;
return text;
}
}
return text;
}
Get objects from the cache.
// Return the LogLine mapped to the virtual index or null if out of range
public object GetNthObject(int index)
{
var cacheIndex = index - m_cacheStartLine;
if (cacheIndex >=0 && cacheIndex < m_cache.Count) {
return m_cache[cacheIndex];
}
return null;
}
Prepare the cache
// Prepare the cache to map to the requested range
public void PrepareCache(int first, int last)
{
m_cacheStartLine = first;
m_cache = new(); // naively just destroy everything and start over
for (var i=first; i<=last; i++) {
var text = ReadLine(i);
if (text == null) {
break;
}
else {
m_cache.Add(new LogLine(text));
}
}
}
}
Below is an example of how to run Initialize
as a background thread to allow the application to remain responsive while the log files are being processed.
public void InitializeInBackground(VirtualListDataSource view)
{
m_initThread = new Thread(new ThreadStart(() => Initialize(view)));
m_initThread.IsBackground = true;
m_initThread.Start();
}
The code in this post has not been tested.