Search code examples
c++clinuxioctlblock-device

How to retrieve underlying block device IO error


Consider a device in the system, something under /dev/hdd[sg][nvme]xx Open the device, get the file descriptor and start working with it (read(v)/write(v)/lseek, etc), at some point you may get EIO. How do you retrieve the underlying error reported by the device driver?

EDIT001: in case it is impossible using unistd functions, maybe there is other ways to work with block devices which can provide more low-level information like sg_scsi_sense_hdr?


Solution

  • You can't get any more error detail out of the POSIX functions. You're onto the right track with the SCSI generic stuff though. But, boy, it's loaded with hair. Check out the example in sg3_utils of how to do a SCSI READ(16). This will let you look at the sense data when it comes back:

    https://github.com/hreinecke/sg3_utils/blob/master/examples/sg_simple16.c

    Of course, this technique doesn't work with NVMe drives. (At least, not to my knowledge).

    One concept I've played with in the past is to use normal POSIX/libc block I/O functions like pread and pwrite until I get an EIO out. At that point, you can bring in the SCSI-generic versions to try to figure out what happened. In the ideal case, a pread or lseek/read fails with EIO. You then turn around and re-issue it using a SG READ (10) or (16). If it's not just a transient failure, this may return sense data that your application can use.

    Here's an example, using the command-line sg_read program. I have an iSCSI attached disk that I'm reading and writing. On the target, I remove its LUN mapping. dd reports EIO:

    # dd if=/dev/sdb of=/tmp/output bs=512 count=1 iflag=direct
    dd: error reading ‘/dev/sdb’: Input/output error
    

    but sg_read reports some more useful information:

    [root@localhost src]# sg_read blk_sgio=1 bs=512 cdbsz=10 count=512 if=/dev/sdb odir=1 verbose=10
    Opened /dev/sdb for SG_IO with flags=0x4002
        read cdb: 28 00 00 00 00 00 00 00 80 00
          duration=9 ms
    reading: SCSI status: Check Condition
     Fixed format, current;  Sense key: Illegal Request
     Additional sense: Logical unit not supported
     Raw sense data (in hex):
            70 00 05 00 00 00 00 0a  00 00 00 00 25 00 00 00
            00 00
    sg_read: SCSI READ failed
    Some error occurred,  remaining block count=512
    0+0 records in
    

    You can see the Logical unit not supported additional sense code in the above output, indicating that there's no such LU at the target.

    Possible? Yes. But as you can see from the code in sg_simple16.c, it's not easy!