Announce Proxmox (PVE) VM IPs via BGP #1
Table of Contents
On November 10th 2022, I had to move all my virtual machines to a new cluster which does not have an L2 router in front or any other IP connectivity where I could do an easy failover.
In the old cluster, I had a simple VyOS box in front and handled all the IP traffic in two VRFs. One for the prefixes provided by Hetzner / AS24940 and the second for my own AS. Since I don’t have any IP resources at Hetzner anymore, I need a solution for the IPs within my own AS.
My first idea was to create a Router VM on each PVE node which acts as a gateway to announce the network via BGP to my route reflectors. But after all, this was a bad idea. Why? I mean, this could theoretically work.
- I’d have to write something to announce the IPs as a more-specific route (VMs have to be migratable), or announce less-specific networks for each node and assign the VMs to the node directly.
- I’d have to maintain more servers / VMs (I already have too many)
- The effort to keep the announcements and the actual VM state in sync isn’t worth it.
For me, the first point already killed my idea. So, how do I solve the issue?
Short story: ✨Hooks✨
Long story:
I searched for something like the libvirt qemu hook in PVE. It turns out that PVE already has hook support for a while, but there’s no documentation in the Wiki, but in the PVE Admin Guide. Unfortunately, this hook implementation does not have a “global” flag. That means you’ve to define the hook for each VM separately (easily automatable).
If a VM should run a hook script, you can set it with qm
on the host:
Okay, but why do you use local:snippets
and not /usr/libexec
or /usr/local/bin
?
That’s another issue in the PVE hookscript implementation. You can only set a hookscript stored in a PVE directory (e.g. NFS, CephFS or local:snippets
).
How do you know which IPs get announced?
At some point, I need a state somewhere of the IPs. The easiest way to solve this is to save the IP config in a JSON file stored on the host . I only have ~30 VMs and two hosts in this cluster, so I can easily store this file on both servers.
Okay, that’s cool. But how do you announce the IPs now?
Network configuration #
Each host in the cluster have a bridge to which every VM with a public network is connected. This bridge acts as the gateway for virtual machines.
The config itself is very basic:
Hold up, why do you set a custom MAC address to the bridge?
In case of a live migration, the VM may get unreachable either temporarily or permanently because the MAC address of the gateway has changed.
JSON Config #
The config must have at least this Infos:
- The MAC Address (I don’t query the PVE API at all)
- The Interface
- IPv4 and IPv6 addresses to be announced
- The Routing Table ID
The Routing Table ID is mandatory for PBR in my case. The traffic for public VMs needs to get routed to another network, but the default route is in another routing table.
The configuration structure I use looks like that:
The config is bound to the VM/Domain ID. So, The config name is, for example, 100.json
.
Bird (the BGP daemon) config #
Bird is a lightweight and simple BGP Daemon. The neat part about bird is it supports RFC 5549 (requires bird >2.0.9 to work with kernel export).
That means the hypervisor does not require any IPv4 address except the private gateway IP for the VMs 🥳.
To summarise what my bird config contains:
- Router ID (must be unique and needs every host)
- Some logging config
- Kernel Protocol for my second kernel routing table
- Direct protocol for next-hop lookup in some cases
- BGP peer config
Or “in config” (only necessary parts):
Why is there a filter in the direct protocol?
I only need to import the P2P routes to my routers to set the next hop correctly.
To explain the BGP config a bit more:
local as 4220000001
I use private ASNs for the host. It has something to do with the internal structure of my network.
This forces bird to handle the routes in my custom routing table public4
(KRT table 100
in my case).
I only need a default route from my routers. It is a precautionary measure to prevent route leaks.
Routes from vmbr0
should get announced to my routers.
Force set the next hop to the host IP address in case it’s something else.
This flag is only required for the IPv4 AFI as I use RFC 5549 to announce IPv4 with an IPv6 next hop.
protocol bgp rtr01 from upstream {}
defines the session from the template upstream
above.
Yeah, a lot of BGP config, but I haven’t explained how to configure the IPs yet.
Introduction to “fast-announcer” #
fast-announcer is a hook I’ve written in Go to configure IP routes, Rules (PBR) and static neighbour entries (IPv4 only). My code is extremely ugly and I recommend everyone to not use it.
fast-announcer uses the netlink library by vishvananda which works pretty well. However, I noticed that this library lacks some documentation. So, I had to reverse-engineer a bit from the _test.go
files.
To keep it short fast-announcer does:
- Parses the arguments PVE uses to run fast-announcer
- Checks the network JSON config
- Add or remove the IP routes, Rules and ARP entries.
PVE will run the hook four times with the following arguments:
fast-announcer vm-id pre-start
fast-announcer vm-id post-start
fast-announcer vm-id pre-stop
fast-announcer vm-id post-stop
I only care about post-start
and post-stop
where I add or remove the IP configuration.