I'm not asking about only reading a large file or reading/writing a xml file which I know there are Xml related classes for handling that. Let me give a more specific description of what I'm trying to do:
I have a very large file size that is about 10TB, which I can not load into memory at once. Meaning, I could not do as below:
var lines = File.ReadAllLines("LargeFile.txt");
var t = 1 << 40;
for(var i= t; i< 2 * t; i++)
{
lines[i] = someWork(); //
}
File.WriteAllLines("LargeFile.txt", lines);
I want to read and update lines in a range between 1 and 2TB.
What's the best approach doing this? Examples of .Net classes or 3rd party libraries would be helpful. I'm also interested in how other languages handle this problem as well.
I tried David's suggestion by using position. However, i feel it doesn't work. 1. the size of FileStream seems fixed, I can modify the bytes, but it will overwrite byte by byte. it my newdata size is large/less than original line of data. I won't be able to update correctly. 2. I didn't find a O(1) way to convert line num to position num. it still take me O(n) to find the position.
below is my try
public static void ReadWrite()
{
var fn = "LargeFile.txt";
File.WriteAllLines(fn, Enumerable.Range(1, 20).Select(x => x.ToString()));
var targetLine = 11; // zero based
long pos = -1;
using (var fs = new FileStream(fn, FileMode.Open, FileAccess.Read, FileShare.Read))
{
while (fs.Position != fs.Length)
{
if (targetLine == 0)
{
pos = fs.Position +1; // move pos to begin of next line;
}
// still take average O(N) time to scan whole file to find the position.
// I'm not sure if there is better way. to redirect to the pos of x line by O(1) time.
if (fs.ReadByte() == '\n')
{
targetLine--;
}
}
}
using (var fs = new FileStream(fn, FileMode.Open, FileAccess.ReadWrite))
{
var data = Encoding.UTF8.GetBytes("999");
fs.Position = pos;
// if the modify data has differnt size compare to the current one
// it will overwrite next lines of data
fs.Write(data, 0, data.Length);
}
}
You don't have to read through the first 1TB to modify the middle of the file. FileStream supports random access. EG
string fn = @"c:\temp\huge.dat";
using (var fs = new FileStream(fn, FileMode.Open, FileAccess.Read, FileShare.Read))
{
fs.Position = (1024L * 1024L * 1024L);
//. . .
}
Once you reposition the filestream you can read and write at the current location, or open a StreamReader to read text from the file. You must, of course, ensure that you move to a byte offset that begins a character in the file's encoding.