I use Proxmox with a ZFS array to run a number of self-hosted services. I have been working on setting up zrepl for offsite backup, replicating encrypted ZFS datasets which the remote system will not be able to decrypt, only store.

While working through all of this, the new 28TB disk intended for the remote system appears to have failed.
root@zfsremote:~# zpool status pool: pool1 state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P config: NAME STATE READ WRITE CKSUM pool1 DEGRADED 0 0 0 sdb DEGRADED 0 35 0 too many errors
Indeed, there are kernel messages about disk errors:
Mar 31 07:16:33 zfsremote kernel: I/O error, dev sdb, sector 23368396833 ... Mar 31 07:16:33 zfsremote kernel: I/O error, dev sdb, sector 23368399137 ... Mar 31 07:16:33 zfsremote kernel: I/O error, dev sdb, sector 23368397089 ... Mar 31 07:16:33 zfsremote kernel: I/O error, dev sdb, sector 23368401697 ... Mar 31 07:16:33 zfsremote kernel: I/O error, dev sdb, sector 23368401441 ... Mar 31 07:16:33 zfsremote kernel: I/O error, dev sdb, sector 23368399393 ... Mar 31 07:16:33 zfsremote kernel: I/O error, dev sdb, sector 23368402721 ... Mar 31 07:16:33 zfsremote kernel: I/O error, dev sdb, sector 23368402465 ... Mar 31 07:16:33 zfsremote kernel: I/O error, dev sdb, sector 23368402209 ... Mar 31 07:16:34 zfsremote kernel: I/O error, dev sdb, sector 23368401953 ...
It seems odd, though. I had run `badblocks` destructive tests for weeks before moving on to creating the ZFS pool. After all that, it would choose this moment to begin uncorrectable failure?
Quite suspiciously, 07:16:33 is also the very instant when I sent a kill signal to a vzdump process running on the Proxmox host.
116: 2025-03-31 07:14:31 INFO: 29% (7.4 TiB of 25.5 TiB) in 9h 37m 2s 116: 2025-03-31 07:16:33 ERROR: interrupted by signal 116: 2025-03-31 07:16:33 INFO: aborting backup job
As I now know, trying to kill vzdump with a signal is not the right thing to do. `vzdump -stop` is the right way to interrupt it.
The OpenZFS docs say: "the following cases will all produce errors that do not indicate potential device failure: 1) A network attached device lost connectivity but has now recovered"
So far as I can tell, this is the explanation for this failure. Me sending a signal to vzdump interrupted the stream of ZFS operations, which manifested as a failed array on the other end. I've cleared the failure using `zpool clear` and will hope that zrepl will sort out bringing the two ZFS filesystems back into sync.
I plan to give it a day, then restore the remote dataset and check whether the file contents are sensible. The remote system does not, and will never, have the encryption key to be able to check the contents of the datasets it holds. I'll have to transfer them back to be able to access them.