Monday, March 31, 2025

ZFS Spooky Failure at a Distance

I use Proxmox with a ZFS array to run a number of self-hosted services. I have been working on setting up zrepl for offsite backup, replicating encrypted ZFS datasets which the remote system will not be able to decrypt, only store.


 

While working through all of this, the new 28TB disk intended for the remote system appears to have failed.

root@zfsremote:~# zpool status
  pool: pool1
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
config:

        NAME        STATE     READ WRITE CKSUM
        pool1       DEGRADED     0     0     0
          sdb       DEGRADED     0    35     0  too many errors

 

Indeed, there are kernel messages about disk errors:

Mar 31 07:16:33 zfsremote kernel: I/O error, dev sdb, sector 23368396833 ...
Mar 31 07:16:33 zfsremote kernel: I/O error, dev sdb, sector 23368399137 ...
Mar 31 07:16:33 zfsremote kernel: I/O error, dev sdb, sector 23368397089 ...
Mar 31 07:16:33 zfsremote kernel: I/O error, dev sdb, sector 23368401697 ...
Mar 31 07:16:33 zfsremote kernel: I/O error, dev sdb, sector 23368401441 ...
Mar 31 07:16:33 zfsremote kernel: I/O error, dev sdb, sector 23368399393 ...
Mar 31 07:16:33 zfsremote kernel: I/O error, dev sdb, sector 23368402721 ...
Mar 31 07:16:33 zfsremote kernel: I/O error, dev sdb, sector 23368402465 ...
Mar 31 07:16:33 zfsremote kernel: I/O error, dev sdb, sector 23368402209 ...
Mar 31 07:16:34 zfsremote kernel: I/O error, dev sdb, sector 23368401953 ...

 

It seems odd, though. I had run `badblocks` destructive tests for weeks before moving on to creating the ZFS pool. After all that, it would choose this moment to begin uncorrectable failure?

Quite suspiciously, 07:16:33 is also the very instant when I sent a kill signal to a vzdump process running on the Proxmox host.

116: 2025-03-31 07:14:31 INFO:  29% (7.4 TiB of 25.5 TiB) in 9h 37m 2s
116: 2025-03-31 07:16:33 ERROR: interrupted by signal
116: 2025-03-31 07:16:33 INFO: aborting backup job

As I now know, trying to kill vzdump with a signal is not the right thing to do. `vzdump -stop` is the right way to interrupt it.

The OpenZFS docs say: "the following cases will all produce errors that do not indicate potential device failure: 1) A network attached device lost connectivity but has now recovered"

So far as I can tell, this is the explanation for this failure. Me sending a signal to vzdump interrupted the stream of ZFS operations, which manifested as a failed array on the other end. I've cleared the failure using `zpool clear` and will hope that zrepl will sort out bringing the two ZFS filesystems back into sync.

I plan to give it a day, then restore the remote dataset and check whether the file contents are sensible. The remote system does not, and will never, have the encryption key to be able to check the contents of the datasets it holds. I'll have to transfer them back to be able to access them.