I am playing around with mmap
to open large files. Assume that the first line of the file is a count of observations. Each observation spans 2 lines:
Observation ID \n
Variable length number of integers
I am doing some computation on these and would like to use multiprocessing
. Is it possible to use seek()
to seek to a line instead of a byte offset?
Clearly, this is easily done using the open
method from file, but, since I'm playing with mmap
, I wonder if it is possible in this context.
Files are streams of bytes, not lines. If you need random access to the start of a particular line in the file, there is no way to know a priori at what offset into the file you will find it. This is true whether you're doing random access via mmap()
, pread()
, seek()
, or any other method.
The only way to solve this problem is to build a mapping between line numbers and byte offsets. This usually means you have to scan through the entire file sequentially once.
Depending on your specific need, other approaches might be applicable. For example, if reaching a target line number approximately is good enough and you have an idea of the average length of a line in the file, perhaps you can seek to the desired line number times the average line length, and use whatever line yu find at that position. Alternatively if you observation IDs are all in numerical order, you can binary search through the file using byte offsets until you find the line you need.