Search code examples
c#regexstreamreader

How to improve performance when parsing large text file - StreamReader + Regex


I am developing a windows form application that takes a Robot Program generated by other software and modifies it. The process of modification is as follows:

  1. The StreamReader.ReadLine() is used parse the file line by line
  2. Regex is used to search for specific keywords in the file. If a match is obtained, the matched string is copied to another string and is replaced with new lines of robot code.
  3. The modified code is saved in a string and is finally written to a new file.

  4. All the collection of matched strings obtained using Regex is also saved in a string and is finally written to a new file.

I have been able to successfully do this

    private void Form1_Load(object sender, EventArgs e)
    {
        string NextLine = null;
        string CurrLine = null;
        string MoveL_Pos_Data = null;
        string MoveL_Ref_Data = null;
        string MoveLFull = null;
        string ModCode = null;
        string TAB = "\t";
        string NewLine = "\r\n";
        string SavePath = null;
        string ExtCode_1 = null;
        string ExtCode_2 = null;
        string ExtCallMod = null;

        int MatchCount = 0;
        int NumRoutines = 0;

        try
        {
            // Ask user location of the source file
            // Displays an OpenFileDialog so the user can select a Cursor.  
            OpenFileDialog openFileDialog1 = new OpenFileDialog
            {
                Filter = "MOD Files|*.mod",
                Title = "Select an ABB RAPID MOD File"
            };

            // Show the Dialog.  
            // If the user clicked OK in the dialog and  
            // a .MOD file was selected, open it.  
            if (openFileDialog1.ShowDialog() == System.Windows.Forms.DialogResult.OK)
            {
                // Assign the cursor in the Stream to the Form's Cursor property.  
                //this.Cursor = new Cursor(openFileDialog1.OpenFile());
                using (StreamReader sr = new StreamReader(openFileDialog1.FileName))
                {
                    // define a regular expression to search for extr calls 
                    Regex Extr_Ex = new Regex(@"\bExtr\(-?\d*.\d*\);", RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Multiline);
                    Regex MoveL_Ex = new Regex(@"\bMoveL\s+(.*)(z\d.*)", RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Multiline);

                    Match MoveLString = null;

                    while (sr.Peek() >= 0)
                    {
                        CurrLine = sr.ReadLine();
                        //Console.WriteLine(sr.ReadLine());

                        // check if the line is a match 
                        if (Extr_Ex.IsMatch(CurrLine))
                        {
                            // Keep a count for total matches
                            MatchCount++;

                            // Save extr calls in a string
                            ExtCode_1 += NewLine + TAB + TAB + Extr_Ex.Match(CurrLine).ToString();


                            // Read next line (always a MoveL) to get Pos data for TriggL
                            NextLine = sr.ReadLine();
                            //Console.WriteLine(NextLine);

                            if (MoveL_Ex.IsMatch(NextLine))
                            {
                                // Next Line contains MoveL
                                // get matched string 
                                MoveLString = MoveL_Ex.Match(NextLine);
                                GroupCollection group = MoveLString.Groups;
                                MoveL_Pos_Data = group[1].Value.ToString();
                                MoveL_Ref_Data = group[2].Value.ToString();
                                MoveLFull = MoveL_Pos_Data + MoveL_Ref_Data;                                

                            }

                            // replace Extr with follwing commands
                            ModCode += NewLine + TAB + TAB + "TriggL " + MoveL_Pos_Data + "extr," + MoveL_Ref_Data;
                            ModCode += NewLine + TAB + TAB + "WaitDI DI1_1,1;";
                            ModCode += NewLine + TAB + TAB + "MoveL " + MoveLFull;
                            ModCode += NewLine + TAB + TAB + "Reset DO1_1;";
                            //break;

                        }
                        else
                        {
                            // No extr Match
                            ModCode += "\r\n" + CurrLine;
                        }                     

                    }

                    Console.WriteLine($"Total Matches: {MatchCount}");
                }


            }

            // Write modified code into a new output file
            string SaveDirectoryPath = Path.GetDirectoryName(openFileDialog1.FileName);
            string ModName = Path.GetFileNameWithoutExtension(openFileDialog1.FileName);
            SavePath = SaveDirectoryPath + @"\" + ModName + "_rev.mod";
            File.WriteAllText(SavePath, ModCode);

            //Write Extr matches into new output file 
            //Prepare module
            ExtCallMod = "MODULE ExtruderCalls";

            // All extr calls in one routine
            //Prepare routines
            ExtCallMod += NewLine + NewLine + TAB + "PROC Prg_ExtCall"; // + 1;
                ExtCallMod += ExtCode_1;
                ExtCallMod += NewLine + NewLine + TAB + "ENDPROC";
                ExtCallMod += NewLine + NewLine;

            //}

            ExtCallMod += "ENDMODULE";

            // Write to file
            string ExtCallSavePath = SaveDirectoryPath + @"\ExtrCalls.mod";                
            File.WriteAllText(ExtCallSavePath, ExtCallMod);                

        }

        catch (Exception ex)
        {
            Console.WriteLine(ex.ToString());                
        }

    }                    
}

While this helps me achieve what I want, the process is very slow. Since I am new to C# programming, I suspect that the slowness is coming from duplicating the original file contents to a string and NOT replacing content in place (I am not sure if contents in original file can be directly replaced). For an input file of 20,000 rows, the whole process is taking a little over 5 minutes.

I used to get the following error: Message=Managed Debugging Assistant 'ContextSwitchDeadlock' : 'The CLR has been unable to transition from COM context 0xb27138 to COM context 0xb27080 for 60 seconds. The thread that owns the destination context/apartment is most likely either doing a non pumping wait or processing a very long running operation without pumping Windows messages. This situation generally has a negative performance impact and may even lead to the application becoming non responsive or memory usage accumulating continually over time. To avoid this problem, all single threaded apartment (STA) threads should use pumping wait primitives (such as CoWaitForMultipleHandles) and routinely pump messages during long running operations.'

I was able to get past it by disabling 'ContextSwitchDeadlock' settings in debugger settings. This may not be the best practice.

Can anyone help me in improving the performance of my code?

EDIT: I found out that the robot controller had limitations on the number of Lines to be in the MOD file (output file). The maximum number of lines allowed was 32768. I came up with a logic to split the contents of string builder to separate output files as follows:

// Split modCodeBuilder into seperate strings based on final size
        const int maxSize = 32500;
        string result = modCodeBuilder.ToString();
        string[] splitResult = result.Split(new string[] { "\r\n" }, StringSplitOptions.None);
        string[] splitModCode = new string[maxSize]; 

        // Setup destination directory to be same as source directory
        string destDir = Path.GetDirectoryName(fileNames[0]);

        for (int count = 0; ; count++)
        {
            // Get the next batch of text by skipping the amount
            // we've taken so far and then taking the maxSize.
            string modName = $"PrgMOD_{count + 1}";
            string procName = $"Prg_{count + 1}()";

            // Use Array Copy to extract first 32500 lines from modCode[]
            int src_start_index = count * maxSize;
            int srcUpperLimit = splitResult.GetUpperBound(0);
            int dataLength = maxSize;

            if (src_start_index > srcUpperLimit) break; // Exit loop when there's no text left to take

            if (src_start_index > 1)
            {
                // Make sure calculate right length so that src index is not exceeded
                dataLength = srcUpperLimit - maxSize;
            }                

            Array.Copy(splitResult, src_start_index, splitModCode, 0, dataLength);
            string finalModCode = String.Join("\r\n", splitModCode);

            string batch = String.Concat("MODULE ", modName, "\r\n\r\n\tPROC ", procName, "\r\n", finalModCode, "\r\n\r\n\tENDPROC\r\n\r\nENDMODULE");

            //if (batch.Length == 0) break; 

            // Generate file name based on count
            string fileName = $"ABB_R3DP_{count + 1}.mod";

            // Write our file text
            File.WriteAllText(Path.Combine(destDir, fileName), batch);

            // Write status to output textbox
            TxtOutput.AppendText("\r\n");
            TxtOutput.AppendText("\r\n");
            TxtOutput.AppendText($"Modified MOD File: {fileName} is generated sucessfully! It is saved to location: {Path.Combine(destDir, fileName)}");
        }

Solution

  • It's possible that the string concatenations are taking a long time. Using a StringBuilder instead may improve your performance:

    private static void GenerateNewFile(string sourceFullPath)
    {
        string posData = null;
        string refData = null;
        string fullData = null;
    
        var modCodeBuilder = new StringBuilder();
        var extCodeBuilder = new StringBuilder();
    
        var extrRegex = new Regex(@"\bExtr\(-?\d*.\d*\);", RegexOptions.Compiled | 
            RegexOptions.IgnoreCase | RegexOptions.Multiline);
    
        var moveLRegex = new Regex(@"\bMoveL\s+(.*)(z\d.*)", RegexOptions.Compiled | 
            RegexOptions.IgnoreCase | RegexOptions.Multiline);
    
        int matchCount = 0;
        bool appendModCodeNext = false;
    
        foreach (var line in File.ReadLines(sourceFullPath))
        {
            if (appendModCodeNext)
            {
                if (moveLRegex.IsMatch(line))
                {
                    GroupCollection group = moveLRegex.Match(line).Groups;
    
                    if (group.Count > 2)
                    {
                        posData = group[1].Value;
                        refData = group[2].Value;
                        fullData = posData + refData;
                    }
                }
    
                modCodeBuilder.Append("\t\tTriggL ").Append(posData).Append("extr,")
                    .Append(refData).Append("\r\n\t\tWaitDI DI1_1,1;\r\n\t\tMoveL ")
                    .Append(fullData).AppendLine("\r\n\t\tReset DO1_1;");
    
                appendModCodeNext = false;
            }
            else if (extrRegex.IsMatch(line))
            {
                matchCount++;
                extCodeBuilder.Append("\t\t").AppendLine(extrRegex.Match(line).ToString());
                appendModCodeNext = true;
            }
            else
            {
                modCodeBuilder.AppendLine(line);
            }
        }
    
        Console.WriteLine($"Total Matches: {matchCount}");
    
        string destDir = Path.GetDirectoryName(sourceFullPath);
        var savePath = Path.Combine(destDir, Path.GetFileNameWithoutExtension(sourceFullPath), 
            "_rev.mod");
    
        File.WriteAllText(savePath, modCodeBuilder.ToString());
    
        var extCallMod = string.Concat("MODULE ExtruderCalls\r\n\r\n\tPROC Prg_ExtCall",
            extCodeBuilder.ToString(), "\r\n\r\n\tENDPROC\r\n\r\nENDMODULE");
    
        File.WriteAllText(Path.Combine(destDir, "ExtrCalls.mod"), extCallMod);
    }
    

    You mentioned in the comments that you want to take batches of the text and write them to separate files. One way to do this would be to treat the string as a char[], and then use the System.Linq extension methods, Skip and Take. Skip will skip a certain amount of characters in a string, and then Take will take a certain amount of characters and return them in an IEnumerabe<char>. We can then use string.Concat to convert this to a string and write it to a file.

    If we have a constant that represents our max size, and a counter that starts at 0, we can use a for loop that increments counter and which skips counter * max characters, and then takes max characters from the string. We can also use the counter variable to create the file name, since it will increment on each iteration:

    const int maxSize = 32500;
    string result = modCodeBuilder.ToString();
    
    for (int count = 0;; count++)
    {
        // Get the next batch of text by skipping the amount
        // we've taken so far and then taking the maxSize.
        string batch = string.Concat(result.Skip(count * maxSize).Take(maxSize));
    
        if (batch.Length == 0) break; // Exit loop when there's no text left to take
    
        // Generate file name based on count
        string fileName = $"filename_{count + 1}.mod";
    
        // Write our file text
        File.WriteAllText(Path.Combine(destDir, fileName), batch);
    }
    

    Another way to do this that might be faster is to use string.Substring, and use count * maxSize as the start index of the substring to take. Then we just need to make sure our length doesn't exceed the bounds of the string, and write the substring to the file:

    for (int count = 0;; count++)
    {
        // Get the bounds for the substring (startIndex and length)
        var startIndex = count * maxSize;
        var length = Math.Min(result.Length - startIndex, maxSize);
    
        if (length < 1) break; // Exit loop when there's no text left to take
    
        // Get the substring and file name
        var batch = result.Substring(startIndex, length);
        string fileName = $"filename_{count + 1}.mod";
    
        // Write our file text  
        File.WriteAllText(Path.Combine(destDir, fileName), batch);
    }
    

    Note that this will split the text into blocks of exactly 32500 characters (except the last block). If you want to take only whole lines, that requires a bit more work but is still not hard to do.