I have a text file with over 5 millions lines in it. I need to run through this line by line and remove certain lines and also replace a certain string. I coded something in C# that 'works' but it can take almost a day to complete which seems insane as doing a search and replace in notepad++ can do it in minutes. We need to automate this however.
The file(s) can arbitrarily contain a line
"<-comment 1: (more text on the line here)"
"<-Another line (more text on the line here)"
I want to remove any line starting with comment 1 or another line...
Also there is a string
which I want to replace with an underscore. This should only appear on a line that starts with "LINK:"
The code i have so far is:
static void Main()
const Int32 BufferSize = 128;
int count = 0;
int count2 = 0;
string filename = @"C:\test\test.txt";
string output = @"C:\text\output.txt";
string Startcomment = @"<-comment 1:";
string Startmoretext= @"<-Another line";
string othercit = @"LINK:";
string sub = @"<tag>—</tag>";
string subrepalce = @"_";
string line;
using (var filestream = File.OpenRead(filename))
Console.WriteLine("Start time: " + DateTime.Now.ToString());
using (var streamreader = new StreamReader(filestream, Encoding.UTF8, true, BufferSize))
File.WriteAllText(output, "Clean text file" + Environment.NewLine);
while ((line = streamreader.ReadLine()) != null)
if(count % 10000 == 0)
Console.WriteLine("Batch complete: " + DateTime.Now.ToString());
if(!line.StartsWith(Startcomment) && !line.StartsWith(Startmoretext))
if (line.StartsWith(othercit))
line = line.Replace(sub, subrepalce);
File.AppendAllText(output, line + Environment.NewLine);
Console.WriteLine(count + " Lines processed");
Console.WriteLine(count2 + " Lines written back");
The run time is just not viable.
I wanted to have this run under a regular expression that would use a config file we could maintain outside the script should we need to add new exceptions, but also seems to run forever.
static void Main()
const Int32 BufferSize = 128;
string filename = @"C:\test\test.txt";
XmlDocument xdoc = new XmlDocument();
XmlElement xmlRoot = xdoc.DocumentElement;
XmlNodeList xmlNodes = xmlRoot.SelectNodes("/root/line");
int count = 0;
string line;
using (var filestream = File.OpenRead(filename))
using (var streamreader = new StreamReader(filestream, Encoding.UTF8, true, BufferSize))
File.WriteAllText(@"C:\test\output.txt", "Clean file" + Environment.NewLine);
while ((line = streamreader.ReadLine()) != null)
string output = line;
foreach (XmlNode node in xmlNodes)
string pattern = node["pattern"].InnerText;
string replacement = node["replacement"].InnerText;
Regex rgx = new Regex(pattern);
output = rgx.Replace(output, replacement);
rgx = null;
if (output.Length > 0)
if (count % 10000 == 0)
File.AppendAllText(@"C:\test\test.txt", output + Environment.NewLine);
XML config file
<?xml version="1.0" encoding="UTF-8"?>
<pattern><![CDATA[<-comment 1:.*]]></pattern>
<pattern><![CDATA[<-Another line.*]]></pattern>
How should something like this be done to work in the most efficient?
I think the following works more efficient as @C.Evenhuis recommends partially...
using (FileStream fs = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BufferedStream bs = new BufferedStream(fs))
using (StreamReader sr = new StreamReader(bs))
using (StreamWriter writer = new StreamWriter("C:\test\test.txt"))
string line;
while ((line = sr.ReadLine()) != null)
string output = line;
foreach (XmlNode node in xmlNodes)
string pattern = node["pattern"].InnerText;
string replacement = node["replacement"].InnerText;
Regex rgx = new Regex(pattern);
output = rgx.Replace(output, replacement);
rgx = null;
if (output.Length > 0)
if (count % 10000 == 0)