NAS Outage Postmortem
On 2024-12-06 around 18:45 CET my NAS died. It’s a HP MicroServer N40L from 2012.
I was out at that time and only noticed the incident after I got home a bit after 23:00.
Duration of the incident: 6 days!!
Impact
Impact was pretty severe:
- backup service down
- cloud down
- no access to calendars and contacts from all devices.
- Files are less of an issue because I don’t access them often.
Root cause
The PSU (Power Supply Unit) died. This was confirmed after I replaced the PSU1.
Symptoms: plugging in the power cable didn’t create a distinct “crack”; pressing the power on button (which always worked like a charm) didn’t trigger power on.
Forums also suggest that the PSU for these models tend to break often.
Remediation
-
Ordered new PSU. I went for the same model (FSP270-60LE). Finding replacement parts didn’t feel easy and I hope manufacturers improve in that area (right-to-repair laws helping): I wanted quality, a decent price, and diligent delivery, but found lots of used items or resellers from far abroad or with terrible reviews. Some forums also suggest a PicoPSU coupled with an AC adapter. But I wanted to try what appeared as the cleaner solution first. With the best compromise I found, the replacement part was expected in about a week 😒, but effectively arrived after 10 days. 😭
-
Restore the cloud data. At least I should have been able to restore the cloud data locally as a temporary solution. Restoring 140GB from the remote backup site takes around 30 hours… And of course I forgot the
-s
flag forzfs send
to resume interrupted streams, so the first attempt was lost. 😭In fact, with all network interruptions and manual resume, this approach turned out less viable and slowly eroded my confidence in being able to restore in a reasonable amount of time… 🥵
Things I also considered:
- turning my nextcloud instance into a caldav-only one, i.e. disable Files, but that’s apparently not possible2;
- plug one of the NAS disks into a server, but apart from being mirrored, the disks are striped (RAID0) and plugging in two drives is more tricky because “servers” would most probably miss SATA or power slots.
-
Do nothing for backups.🤞 My bet is that I can survive without any safety net for a week. But I may set something up if recovery takes longer.
-
Look for a potential NAS replacement. Again nothing stood out: no decent, cheap, 4-bay, FreeBSD-compatible NAS. Maybe not a rack server in my living-room? 🤔 How about a cheap tower or desktop with an HBA card?
Well this used PowerEdge T340 popped up after a few days of research. All-equipped with drives (4×4TB+2×2TB), redundant PSUs, RAID controller, iDRAC, 32 GB ECC RAM and… Noctura silent fans! All for a very attractive price. I thought it might be a bit overkill for my intended usage (i.e. simple ZFS NAS). Estimated power consumption would at least double. Then I figured I could actually host compute loads on it as well, especially if it’s so robust, and thus get rid of other servers. But eventually someone else was less hesitant and got it first. 😭
So I went back to the market and found another N40L for 40€ (with disks) and just bought it. Back home I plugged the old disks into the new MicroServer and all services were back on. 🎉
Learnings
-
Spares or no spares? I was completely caught by surprise as I somehow was under the impression that the NAS was this invincible bedrock of my tiny infrastructure. Maybe the storage redundancy, but also because as long as I keep my data safe somewhere, I can just replace containers and servers easily.
Yes I had to replace the NAS OS SSD recently but never thought about the PSU failing! While I do have spares for “servers” (refurbished laptops and desktops), I didn’t think of any for the NAS. Nor for the router by the way.
As I can’t have spares for everything, I guess the best strategy is to have enough to restore services in degraded mode, while searching for the best replacement at hand. I could also be more proactive, like going for the cheap solution now, which would buy me time to work out the better solution in peace later.
There are diverging opinions on Reddit r/homelab about redundancy, depending on how serious people are about self-hosting and what they host.
Anyways now I will have a spare for this critical piece of infrastructure that is my NAS.
-
Local backups. Someone on a forum was kind enough to mention the “3-2-1” backup rule: 3 copies, 2 different media, 1 copy off-site. The “3 copies” part is to be understood as: original data, on-site snapshot, off-site snapshot.
The remote data restore never succeeded 💀 Eventually after automating resuming interrupted transfers, after more than 2 days of trying, the remote server went offline.
I don’t have a definitive conclusion for this one. Yes I will backup cloud data locally (as long as it fits other storage areas) for such disaster scenarios and because it’s relatively easy. But having a NAS with mirrored disks and a spare NAS is probably a sufficient guaranty. I might also explore waking up the spare NAS to actually backup the primary in the future3…
I guess as long as I have a spare NAS to plug disks into I’m reasonably safe. I will think of remote backups as the last resort / end-of-times solution.
-
Bad timing. It really was bad timing and I don’t think there’s much I can do about it generally:
- Wrong weekend as it was a family member’s birthday and I had to prepare the party.
- Wrong planning as I was about to upgrade FreeBSD to 14.2 along with my email and jail setups, so my infrastructure-as-code is currently a bit out of sync.
- Wrong season as we’re approaching the Christmas holidays and I have to leave home for a while to visit family.
-
Mobile phone sync. Somehow my phone hasn’t kept any local copy of calendars, and many contacts were missing. Not sure if it’s my collaboration syncing app DAVx⁵, or the calendar app Etar or the stock contact app that are not offline-ready.
That was sufficient of a pain that I bought a paper agenda. That and I gave org-mode a try for managing meetings. Maybe actually this is the beginning of a journey towards less dependency on my phone: paper agenda and wrist watch, yeah! 😄
Links
-
In retrospect I should have tested the PSU with a multimeter. It’s not hard and I would have been 100% sure from the start. That said it would probably not have changed following decisions as 1. I was already pretty confident the PSU was the culprit, 2. the motherboard could also have been hit and that’s probably harder to check without a PSU anyways. ↩︎
-
I also learned later that the nextcloud CLI doesn’t allow for exporting DAV data from the database. ↩︎
-
Here’s a good read about mirror vdevs that reminds to backup pool, no excuse. Which I do, locally only the critical parts, remotely everything. ↩︎