Search code examples
winapintfsntfs-mftdefragmentation

Can LockFileEx be used with Volume Handles?


I'm experimenting with FSCTL_MOVE_FILE. Mostly everything is working as expected. However, sometimes if I try to re-read (via FSCTL_GET_NTFS_FILE_RECORD) the Mft record I just moved, I'm getting some bad data.

Specifically, if the file record says the $ATTRIBUTE_LIST attribute is non-resident and I use my volume handle to read the data from the disk, I find that the data there is internally inconsistent (record length is greater than the actual length of data).

As soon as I saw this happening, the cause was pretty clear: I'm reading the record before the Ntfs driver is finished writing it. Debugging supports this theory. But knowing that doesn't help me solve it. I'm using the synchronous method for the FSCTL_MOVE_FILE call, but apparently the file system can still be updating stuff in the background. Hmm.

In a normal file, I'd be thinking LockFileEx with a shared lock (since I'm just reading). But I'm not sure that has any meaning for volume handles? And I'm even less sure Ntfs uses this mechanism internally to ensure consistency.

Still, it seems like a place to start. But my LockFileEx call against the volume handle is returning ERROR_INVALID_PARAMETER. I'm not seeing what parameter may be in error, unless it's the volume handle itself. Perhaps they just don't support locks? Or maybe there's some special flags I'm supposed to set in CreateFile when opening the volume handle? I've tried enabling SE_BACKUP_NAME and FILE_FLAG_BACKUP_SEMANTICS, but the error remains unchanged.

Moving forward, I can see a few alternatives here:

  1. Figure out how to lock sections using a volume handle (and hope the Ntfs driver is doing the same). Seems dubious at this point.
  2. Figure out how to flush the meta data for the file I just moved (nb: FlushFileBuffers for the MOVE_FILE_DATA.FileHandle didn't help. Maybe flushing the volume handle?).
  3. Is there some 'official' means for reading non-resident data that doesn't involve ReadFile against a volume handle? I didn't find one, but maybe I missed it.
  4. Wait "a bit" after moving data to let the driver complete updating everything. Yuck.

FWIW, here's some test code for doing LockFileEx against a volume handle. Note that you must be running as an administrator to lock volume handles. I'm using J:, since that's my flash drive. 50000 was picked at random, but should be less than the size of a flash drive.

void Lock()
{
    WCHAR path[] = L"\\\\.\\j:";

    HANDLE hRootHandle = CreateFile(path,
                             GENERIC_READ, 
                             FILE_SHARE_READ | FILE_SHARE_WRITE, 
                             NULL, 
                             OPEN_EXISTING, 
                             0, 
                             NULL);

    OVERLAPPED olap;
    memset(&olap, 0, sizeof(olap));
    olap.Offset = 50000;

    // Lock 1k of data at offset 50000
    BOOL b = LockFileEx(hRootHandle, 1, 0, 1024, 0, &olap);
    DWORD j = GetLastError();

    CloseHandle(hRootHandle);
}

The code for seeing the bad data is... rather involved. However it is readily reproducible. When it fails, I end up trying to read variable length $ATTRIBUTE_LIST entries that have '0' length, which results in an infinite loop since it looks like I never finished reading the entire buffer. I'm working around it by exiting if the length is zero, but I worry about "leftover garbage" in the buffer instead of nice clean zeros. Detecting that would be impossible, so I'm hoping for a better solution.

Not surprisingly, there isn't a lot of info out there on any of this. So if someone has some experience here, I could use some insight.


Edit 1:

More things that don't quite work:

  • Still no luck on LockFileEx.
  • I tried flushing the volume handle (as Paul suggested). And while this works, it more than doubles my execution time. And, strictly speaking, it still doesn't solve the problem. There's still no guarantee that Ntfs isn't going to change things some more between the FlushFileBuffers and FSCTL_GET_NTFS_FILE_RECORD / ReadFile.
  • I wondered about the 'RecordChanged' timestamp of the $STANDARD_INFORMATION attribute. However, it's not being changed due to these changes to ATTRIBUTE_LIST.
  • Fragmenting a file eventually causes an ATTRIBUTE_LIST to get added, and as fragmentation continues to increase, more DATA records will get added to that list. When a DATA record gets added, the UpdateSequenceNumber (not the one that's part of the MFT_SEGMENT_REFERENCE, the other one) gets updated. Unfortunately, there's a sequence of events to perform this update. And apparently the ATTRIBUTE_LIST buffer 'length' gets updated before the 'UpdateSequenceNumber'. So seeing if the 'UpdateSequenceNumber' has changed doesn't help avoid reading (potentially) bad information.

My next best thought is to see if perhaps Ntfs always zeros the new bytes before updating the record length (or maybe whenever the record length shrinks?). If I can depend on the record length being zero (instead of whatever leftover data might occupy those bytes), I can pretend to call this fixed.


Solution

  • I think I've got it.

    To reiterate the goal:

    After using FSCTL_GET_NTFS_FILE_RECORD to read a record from the Mft, I kept finding that the ATTRIBUTE_LIST record was in an 'inconsistent state` such that the reported record length was greater than the actual amount of data in the record. Reading data beyond what had been written to seemed risky, as I couldn't be sure whether what I read was valid, or leftover garbage.

    To that end, I suggested 4 alternatives that I hoped would let me work around this.

    1. Using LockFileEx on the volume (which seemed like the best answer when I began) turns out to be a complete non-starter. RbMm & eryksun (as well as my own experimentation) provide some pretty compelling evidence this just won't work. As the 'File' in LockFileEx implies, this function only works on files.
    2. Flushing the volume handle makes the symptoms go away. But at a huge (> 100%) penalty in performance. It's also not clear whether the problem is actually solved, or merely hidden behind the slowdown this causes.
    3. The idea of 'some other' api to read non-resident data seems mythical.
    4. Waiting some (unspecified) amount of time after doing a FSCTL_MOVE_FILE is not a plan, it's a hope.

    For a brief time, it looked like checking the UpdateSequenceNumber in the NtfsRecord might provide a solution. However, the order of events Ntfs uses when updating a record means that the record length of the ATTRIBUTE_LIST gets updated (well) before the UpdateSequenceNumber.

    But then I began thinking about exactly when this might be a problem. If I ignore it, where will it fail?

    Currently I am experiencing the problem as the ATTRIBUTE_LIST is growing (since I am deliberately and massively fragmenting a file). And at that time, it's easily detectable due to the zero record length. I've run the program a number times, and while it's just anecdotal, the extra space as the record grows has always been zeroed. This makes sense, as you'd zero out the entire buffer when you first allocate it. Both standard programming practice and observation support this conclusion.

    But what about when the record starts to shrink? Or shrinks and then grows? Could you end up with leftover data there instead of the (easily interpreted) zeros?

    Then it hit me: The ATTRIBUTE_LIST never shrinks. I was just complaining about this a few weeks ago. Even when you completely defragment the file and all these extra DATA records are no longer required, Ntfs doesn't compact them. And now for the first time I have a glimpse of why that might be. There's a possibility this might change in W10, but that might just be an overly optimistic interpretation of an undocumented function.

    So, I don't need to worry about reading garbage data (possibly including a meaningless record length causing me to overrun the buffer). The record length in the ATTRIBUTE_LIST can be trusted. There is just the possibility that the last record might have a zero record length.

    I can either ignore the zero length record (essentially returning the pre-growth information) or reread the record until the UpdateSequenceNumber changes (indicating that the update is complete).

    Tada.