Search code examples
pythonunixsparse-file

Sparse files: How to find contents


If I create a file, use lseek(2) to jump to a high position in the (empty) file, then write some valuable information there, I create a sparse file on Unix system (probably depending on the file system I use, but let's assume I'm using a typical Unix file system like ext4 or similar, there this is the case).

If I then lseek(2) to an even higher position in the file, write something there as well, I end up with a sparse file which contains somewhere in its middle the valuable information, surrounded by a huge amount of sparse file. I'd like to find this valuable information within the file without having to read it completely.

Example:

$ python
f = open('sparse', 'w')
f.seek((1<<40) + 42)
f.write('foo')
f.seek((1<<40) * 2)
f.write('\0')
f.close()

This will create a 2TB file which uses only 8k of disk space:

$ du -h sparse 
8.0K    sparse

Somewhere in the middle of it (at 1TB + 42 bytes) is the valuable information (foo).

I can find it using cat sparse of course, but that will read the complete file and print immense amounts of zero bytes. I tried with smaller sizes and found that this method will take about 3h to print the three characters on my computer.

The question is:

Is there a way to find the information stored in a sparse file without reading all the empty blocks as well? Can I somehow find out where empty blocks are in a sparse file using standard Unix methods?


Solution

  • Just writing an answer based on the previous comments:

    #!/usr/bin/env python3
    from errno import ENXIO
    from os import lseek
    from sys import argv, stderr
    
    SEEK_DATA = 3
    SEEK_HOLE = 4
    
    def get_ranges(fobj):
        ranges = []
        end = 0
    
        while True:
            try:
                start = lseek(fobj.fileno(), end, SEEK_DATA)
                end = lseek(fobj.fileno(), start, SEEK_HOLE)
                ranges.append((start, end))
            except OSError as e:
                if e.errno == ENXIO:
                    return ranges
    
                raise
    
    def main():
        if len(argv) < 2:
            print('Usage: %s <sparse_file>' % argv[0], file=stderr)
            raise SystemExit(1)
    
        try:
            with open(argv[1], 'rb') as f:
                ranges = get_ranges(f)
                for start, end in ranges:
                    print('[%d:%d]' % (start, end))
                    size = end-start
                    length = min(20, size)
                    f.seek(start)
                    data = f.read(length)
                    print(data)
        except OSError as e:
            print('Error:', e)
            raise SystemExit(1)
    
    if __name__ == '__main__': main()
    

    It probably doesn't do what you want, however, which is returning exactly the data you wrote. Zeroes may surround the returned data and must be trimmed by hand.

    Current status of SEEK_DATA and SEEK_HOLE are described in https://man7.org/linux/man-pages/man2/lseek.2.html:

    SEEK_DATA and SEEK_HOLE are nonstandard extensions also present in Solaris, FreeBSD, and DragonFly BSD; they are proposed for inclusion in the next POSIX revision (Issue 8).