I have a very large file (several GB) in AWS S3, and I only need a small number of lines in the file which satisfy a certain condition. I don't want to load the entire file in-memory and then search for and print those few lines - the memory load for this would be too high. The right way would be to only load those lines in-memory which are needed.
As per AWS documentation to read from file:
fullObject = s3Client.getObject(new GetObjectRequest(bucketName, key));
displayTextInputStream(fullObject.getObjectContent());
private static void displayTextInputStream(InputStream input) throws IOException {
// Read the text input stream one line at a time and display each line.
BufferedReader reader = new BufferedReader(new InputStreamReader(input));
String line = null;
while ((line = reader.readLine()) != null) {
System.out.println(line);
}
System.out.println();
}
Here we are using a BufferedReader. It is not clear to me what is happening underneath here.
Are we making a network call to S3 each time we are reading a new line, and only keeping the current line in the buffer? Or is the entire file loaded in-memory and then read line-by-line by BufferedReader? Or is it somewhere in between?
One of the answer of your question is already given in the documentation you linked:
Your network connection remains open until you read all of the data or close the input stream.
A BufferedReader
doesn't know where the data it reads is coming from, because you're passing another Reader
to it. A BufferedReader
creates a buffer of a certain size (e.g. 4096 characters) and fills this buffer by reading from the underlying Reader
before starting to handing out data of calls of read()
or read(char[] buf)
.
The Reader
you pass to the BufferedReader
is - by the way - using another buffer for itself to do the conversion from a byte
-based stream to a char
-based reader. It works the same way as with BufferedReader
, so the internal buffer is filled by reading from the passed InputStream
which is the InputStream
returned by your S3-client.
What exactly happens within this client if you attempt to load data from the stream is implementation dependent. One way would be to keep open one network connection and you can read from it as you wish or the network connection can be closed after a chunk of data has been read and a new one is opened when you try to get the next one.
The documentation quoted above seems to say that we've got the former situation here, so: No, calls of readLine
are not leading to single network calls.
And to answer your other question: No, a BufferedReader
, the InputStreamReader
and most likely the InputStream
returned by the S3-client are not loading in the whole document into memory. That would contradict the whole purpose of using streams in the first place and the S3 client could simply return a byte[][]
instead (to come around the limit of 2^32 bytes per byte
-array)
Edit: There is an exception of the last paragraph. If the whole gigabytes big document has no line breaks, calling readLine
will actually lead to the reading of the whole data into memory (and most likely to a OutOfMemoryError). I assumed a "regular" text document while answering your question.