At the end of last year, I made a post talking about how I don’t need no stinkin’ HA in my homelab. That post is still largely true, but I did make some changes and re-implement a High Availability cluster within Proxmox, but did so without adding another heavy server node back into the cluster.
Eating My Words About HA
In that post I talked a bit about how my workloads didn’t really need the full-blown live migratable, zero downtime setup I had before. For the most part, that’s all still true: all but about three VMs can be stopped for hours with it only being slightly annoying. The three VMs that are left, though, straddle the line between annoying and quite inconvenient when down. In truth, I didn’t really think of those machines when making that change (they were brand new at the time).
Specifically, these machines are a Zentyl domain controller, my monitoring machines, and the machine that does all the heavy lifting for my scanners. The desktops in my house are domain joined, so an outage there breaks quite a bit. Monitoring is obviously important to tell me these machines are offline. The scanners do tend to have a few dozen people listening to streams during the day, so I try to keep them online as much as possible.
What showed me the weakness of my setup was my two year old Chaos Monkey that found his way into the room where my servers are running, ran up to the node that runs these important jobs, and pushed the power button to trigger a shutdown in Proxmox. I didn’t catch he did that right away, and since my monitoring is on that node I didn’t pick up it was offline for about an hour.
That’s one of the use cases of HA, though, is to offer a path for machines to restart in the event of human error.
A Two-Node Setup
My Proxmox Cluster runs two nodes, my main node that is power hungry and my important node that will keep these important services running even if the power goes out. I use ZFS to replicate the VMs between nodes, so they can migrate without much issue.
Proxmox (and most clusters) really prefer you run three nodes for an idea called quorum. The idea is you need over 50% of the nodes online and talking to each other to identify cases where nodes may become isolated. The important part is that three is the usual minimum, a two node cluster would need both nodes online as 1 node is only 50%, and it needs over 50%.
To work around this, Corosync (which is what does the important election stuff in Proxmox) can use a voter node or qdevice to break this tie. You still have 2 compute nodes, but some external device is counted for the node count. You can have a healthy, quorate cluster with both compute nodes, or one compute node and this voter node. The Proxmox docs talks about this more.
The kicker in my case goes back to my lower power core that keeps running in an outage. I need to maintain cluster quorum when the rest of the network shuts down and ideally put the voter node where it is independent of the Proxmox Cluster. Keeping alive in a power outage ruled out any external server of machine elsewhere in the house, and keeping it independent of Proxmox ruled out my Raspberry Pis since their root disk is hosted on Proxmox via iSCSI. Adding more full blown Proxmox nodes is out of the question.
What I settled on was the UDM Pro in my rack, an ARM64, Debian Linux running router. Since the voter node is especially lightweight, there’s no real performance hit on the router (at least compared to if I ran Protect or some other Unifi application on it).
A word about HA and ZFS
One thing that I’ve been burned by before (which is why I avoided this setup for years) is using HA with ZFS. If you use ZFS replication, Proxmox will copy the machine’s storage to other nodes on a fixed schedule, and enable HA failover. If a node goes offline, the most recent successful sync is used to reboot the machine, which means three things:
- You’re going to lose the data that changed since your last sync. If you sync every hour, it could be up to an hour if everything works.
- If the sync hasn’t been successful for some time, it will lose the data since the last sync. If that was a day, you’ll lose a day. If it hasn’t been synced for a week, you’ll lose a week.
- Having frequent snapshot backups is crucial for when the ZFS replication loses things.
The key change here compared to my setup in the past is I’m taking full snapshot backups of these machines every two hours now. Thanks to Proxmox Backup Server, I’m able to do that and not waste tons of space. Even if the stars align and I have a failover event where ZFS replication has been broken for some time, I’ll have frequent checkpoints I can fall back on.
The Challenges
The UDM Pro isn’t made for this, clearly. It assumes all the config changes are
made inside the various apps it hosts, so OS level changes are often wiped with
firmware updates or even simple reboots. There is persistent storage in /data
that should survive, so we’ll need to keep copies of our config there.
Of note, I did find this other guide to do the same thing, but it uses Docker and containers. Given the UDM’s complicated network setup since it’s a router, this seemed more trouble than benefit. Likewise, my internal Proxmox Management network is a L3 network, so the UDM doesn’t actually sit directly on it and it felt like work to adapt the guide.
Since the UDM has the normal Debian repos still enabled and is running mostly Debian, it’s easy to just install the software directly and persist things manually.
The process
Do all of this at your own risk, you can brick the device messing with the operating system
Set up on the UDM
- Enable SSH access and note the password. This will be needed so Proxmox can join the voter to the cluster.
- Create a directory in
/datawhere config and packages can be persisted. I made mine/data/qnetd. - Install the needed corosync package:
apt update && apt install corosync-qnetd.
We also need to install an on-boot library that will run scripts when the UDM starts. Follow the install guide or just run:
curl -fsL "https://raw.githubusercontent.com/unifi-utilities/unifios-utilities/HEAD/on-boot-script/remote_install.sh" | /bin/bash
This will create a new /data/on_boot.d directory, and we can put scripts in
there to run during the UDMs normal boot.
Configure the voter
At this point, we can create the needed configuration to have the voter node join. Keep in mind, that you should not reboot the UDM (or do any firmware upgrades) from this point until persistence is configured.
- Install the qdevice package on all cluster Proxmox nodes:
apt install corosync-qdevice - On a single node, configure the device and join it via
pvecm qdevice setup <UDM-IP>. You’ll be prompted for the SSH password you configured, and it should configure key based authentication. - Verify the voter is visible with
pvecm statusin Proxmox.
Configure persistence
At this point if the UDM reboots, you may lose the configuration. We’ll need to
keep copies of everything in our folder in /data so it can be reinstalled in
the future.
First, download the packages we need in case we need to reinstall.
mkdir -p /data/qnetd/packages
cd /data/qnetd/packages
apt-get download corosync-qnetd libnss3 libnss3-tools
This downloads the apt packages needed, and should things get uninstalled this will let us reinstall but not depend on routing or the UDMs network stack to be fully functional.
Next, we need to copy in the corosync config and SSH authorized keys. Both can be removed during reboots or firmware updates:
cp -a /etc/corosync/qnetd /data/qnetd/etc-qnetd
cp /root/.ssh/authorized_keys /data/qnetd/authorized_keys.backup
And finally, make a script that our on boot package can call to set things up
boot. Put this anywhere in /data/on_boot.d, I called mine 99-qnetd.sh
#!/bin/bash
set -x
PERSIST_DIR="/data/qnetd"
Q_USER="coroqnetd"
# 1. Restore the packages if the binary is missing (post-firmware update)
if [ ! -f /usr/bin/corosync-qnetd ]; then
echo "Restoring corosync-qnetd from local cache..."
# Install from local .deb files to avoid needing internet/repos
dpkg -i "$PERSIST_DIR"/packages/*.deb
fi
# 2. Re-link the persistent configuration/certificates
# Proxmox needs these specific certs to recognize the QDevice
mkdir -p /etc/corosync
cp -rav "$PERSIST_DIR"/etc-qnetd/* /etc/corosync/.
mkdir -p /var/run/corosync-qnetd
chown "$Q_USER:$Q_USER" /var/run/corosync-qnetd
chown -R "$Q_USER:$Q_USER" "$PERSIST_DIR/etc-qnetd"
# 3. Restore Proxmox SSH access (Critical for cluster communication)
mkdir -p /root/.ssh
if [ -f "$PERSIST_DIR/authorized_keys.backup" ]; then
cat "$PERSIST_DIR/authorized_keys.backup" >> /root/.ssh/authorized_keys
sort -u /root/.ssh/authorized_keys -o /root/.ssh/authorized_keys
chmod 600 /root/.ssh/authorized_keys
fi
# 4. Start the daemon
echo "Starting corosync-qnetd..."
systemctl restart corosync-qnetd --no-block
This script basically repeats everything each time the UDM boots and should restore things that get removed by the Unifi packages. I’ve tested this across reboots without issue, but I haven’t had a firmware upgrade to test with.
Firmware updates
Thanks to a recent firmware update, I did test if this was able to survive a firmware upgrade and the answer is probably yes. I found a few bugs in my scripts that were fixed above, and some errors in translation from generalizing things for this post.
It’s worth checking pvecm status on your cluster after you do a major upgrade
just to make sure the voter is still present. You should see this output at the
end. A 0 under votes means the qdevice isn’t online.
Membership information
----------------------
Nodeid Votes Qdevice Name
0x00000001 1 A,V,NMW 10.0.7.16
0x00000002 1 A,V,NMW 10.0.7.14 (local)
0x00000000 1 Qdevice