How to retrieve underlying block device IO error

Consider a device in the system, something under /dev/hdd[sg][nvme]xx Open the device, get the file descriptor and start working with it (read(v)/write(v)/lseek, etc), at some point you may get EIO. How do you retrieve the underlying error reported by the device driver?

EDIT001: in case it is impossible using unistd functions, maybe there is other ways to work with block devices which can provide more low-level information like sg_scsi_sense_hdr?

Solution

You can't get any more error detail out of the POSIX functions. You're onto the right track with the SCSI generic stuff though. But, boy, it's loaded with hair. Check out the example in sg3_utils of how to do a SCSI READ(16). This will let you look at the sense data when it comes back:

https://github.com/hreinecke/sg3_utils/blob/master/examples/sg_simple16.c

Of course, this technique doesn't work with NVMe drives. (At least, not to my knowledge).

One concept I've played with in the past is to use normal POSIX/libc block I/O functions like pread and pwrite until I get an EIO out. At that point, you can bring in the SCSI-generic versions to try to figure out what happened. In the ideal case, a pread or lseek/read fails with EIO. You then turn around and re-issue it using a SG READ (10) or (16). If it's not just a transient failure, this may return sense data that your application can use.

Here's an example, using the command-line sg_read program. I have an iSCSI attached disk that I'm reading and writing. On the target, I remove its LUN mapping. dd reports EIO:

# dd if=/dev/sdb of=/tmp/output bs=512 count=1 iflag=direct
dd: error reading ‘/dev/sdb’: Input/output error

but sg_read reports some more useful information:

[root@localhost src]# sg_read blk_sgio=1 bs=512 cdbsz=10 count=512 if=/dev/sdb odir=1 verbose=10
Opened /dev/sdb for SG_IO with flags=0x4002
    read cdb: 28 00 00 00 00 00 00 00 80 00
      duration=9 ms
reading: SCSI status: Check Condition
 Fixed format, current;  Sense key: Illegal Request
 Additional sense: Logical unit not supported
 Raw sense data (in hex):
        70 00 05 00 00 00 00 0a  00 00 00 00 25 00 00 00
        00 00
sg_read: SCSI READ failed
Some error occurred,  remaining block count=512
0+0 records in

You can see the Logical unit not supported additional sense code in the above output, indicating that there's no such LU at the target.

Possible? Yes. But as you can see from the code in sg_simple16.c, it's not easy!