Search code examples
c#streamfilestreammemorystreamline-count

Get Estimate of Line Count in a text file


I would like to get an estimate of the number of lines in a csv/text file so that I can use that number for a progress bar. The file could be extremely large so getting the exact number of lines will take too long for this purpose.

What I have come up with is below (read in a portion of the file and count the number of lines and use the file size to estimate the total number of lines):

    public static int GetLineCountEstimate(string file)
    {
        double count = 0;
        using (var fs = new FileStream(file, FileMode.Open, FileAccess.Read))
        {
            long byteCount = fs.Length;
            int maxByteCount = 524288;
            if (byteCount > maxByteCount)
            {
                var buf = new byte[maxByteCount];
                fs.Read(buf, 0, maxByteCount);
                string s = System.Text.Encoding.UTF8.GetString(buf, 0, buf.Length);
                count = s.Split('\n').Length * byteCount / maxByteCount;
            }
            else
            {
                var buf = new byte[byteCount];
                fs.Read(buf, 0, (int)byteCount);
                string s = System.Text.Encoding.UTF8.GetString(buf, 0, buf.Length);
                count = s.Split('\n').Length;
            }
        }
        return Convert.ToInt32(count);
    }

This seems to work ok, but I have some concerns:

1) I would like to have my parameter simply as Stream (as opposed to a filename) since I may also be reading from the clipboard (MemoryStream). However Stream doesn't seem to be able to read n bytes at once into a buffer or get the total length of the Stream in bytes, like FileStream can. Stream is the parent class to both MemoryStream and FileStream.

2) I don't want to assume an encoding such as UTF8

3) I don't want to assume an end of line character (it should work for CR, CRLF, and LF)

I would appreciate any help to make this function more robust.


Solution

  • Here is what I came up with as a more robust solution for estimating line count.

    public static int EstimateLineCount(string file)
    {
        using (var fs = new FileStream(file, FileMode.Open, FileAccess.Read))
        {
            return EstimateLineCount(fs);
        }
    }
    
    public static int EstimateLineCount(Stream s)
    {
        //if file is larger than 10MB estimate the line count, otherwise get the exact line count
        const int maxBytes = 10485760; //10MB = 1024*1024*10 bytes
    
        s.Position = 0;
        using (var sr = new StreamReader(s, Encoding.UTF8))
        {
            int lineCount = 0;
            if (s.Length > maxBytes)
            {
                while (s.Position < maxBytes && sr.ReadLine() != null)
                    lineCount++;
    
                return Convert.ToInt32((double)lineCount * s.Length / s.Position);
            }
    
            while (sr.ReadLine() != null)
                lineCount++;
            return lineCount;
        }
    }