Search code examples
c#multithreadingoptimizationtext-filesstreamreader

How to optimize the counting of words and characters in a huge file using multithreading?


I have a very large text file around 1 GB.

I need to count the number of words and characters (non-space characters).

I have written the below code.

string fileName = "abc.txt";
long words = 0;
long characters = 0;
if (File.Exists(fileName))
{
    using (StreamReader sr = new StreamReader(fileName))
    {
        string[] fields = null;
        string text = sr.ReadToEnd();
        fields = text.Split(' ', StringSplitOptions.RemoveEmptyEntries);
        foreach (string str in fields)
        {
            characters += str.Length;
        }
        words += fields.LongLength;
    }

    Console.WriteLine("The word count is {0} and character count is {1}", words, characters);
}

Is there any way to make it faster using threads, someone has suggested me to use threads so that it will be faster?

I have found one issue in my code that will fail if the numbers of words or characters are greater than the long max value.

I have written this code assuming that there will be only English characters, but there can be non-English characters as well.

I am especially looking for the thread related suggestions.


Solution

  • Here is how you could tackle the problem of counting the non-whitespace characters of a huge text file efficiently, using parallelism. First we need a way to read blocks of characters in a streaming fashion. The native File.ReadLines method doesn't cut it, since the file is susceptible of having a single line. Below is a method that uses the StreamReader.ReadBlock method to grab blocks of characters of a specific size, and return them as an IEnumerable<char[]>.

    public static IEnumerable<char[]> ReadCharBlocks(String path, int blockSize)
    {
        using (var reader = new StreamReader(path))
        {
            while (true)
            {
                var block = new char[blockSize];
                var count = reader.ReadBlock(block, 0, block.Length);
                if (count == 0) break;
                if (count < block.Length) Array.Resize(ref block, count);
                yield return block;
            }
        }
    }
    

    With this method in place, it is then quite easy to parallelize the parsing of the characters blocks using PLINQ:

    public static long GetNonWhiteSpaceCharsCount(string filePath)
    {
        return Partitioner
            .Create(ReadCharBlocks(filePath, 10000), EnumerablePartitionerOptions.NoBuffering)
            .AsParallel()
            .WithDegreeOfParallelism(Environment.ProcessorCount)
            .Select(chars => chars
                .Where(c => !Char.IsWhiteSpace(c) && !Char.IsHighSurrogate(c))
                .LongCount())
            .Sum();
    }
    

    What happens above is that multiple threads are reading the file and processing the blocks, but reading the file is synchronized. Only one thread at a time is allowed to fetch the next block, by calling the IEnumerator<char[]>.MoveNext method. This behavior does not resemble a pure producer-consumer setup, where one thread would be dedicated to reading the file, but in practice the performance characteristics should be the same. That's because this particular workload has low variability. Parsing each character block should take approximately the same time. So when a thread is done with reading a block, another thread should be in the waiting list for reading the next block, resulting to the combined reading operation being almost continuous.

    The Partitioner configured with NoBuffering is used so that each thread acquires one block at a time. Without it the PLINQ utilizes chunk partitioning, which means that progressively each thread asks for more and more elements at a time. Chunk partitioning is not suitable in this case, because the mere act of enumerating is costly.

    The worker threads are provided by the ThreadPool. The current thread participates also in the processing. So in the above example, assuming that the current thread is the application's main thread, the number of threads provided by the ThreadPool is Environment.ProcessorCount - 1.

    You may need to fine-tune to operation by adjusting the blockSize (larger is better) and the MaxDegreeOfParallelism to the capabilities of your hardware. The Environment.ProcessorCount may be too many, and 2 could probably be enough.

    The problem of counting the words is significantly more difficult, because a word may span more than one character blocks. It is even possible that the whole 1 GB file contains a single word. You may try to solve this problem by studying the source code of the StreamReader.ReadLine method, that has to deal with the same kind of problem. Tip: if one block ends with a non-whitespace character, and the next block starts with a non-whitespace character as well, there is certainly a word split in half there. You could keep track of the number of split-in-half words, and eventually subtract this number from the total number of words.