Skip to main content

Announce Proxmox (PVE) VM IPs via BGP #1

·7 mins
This post is a bit chaotic and some information may miss.

On November 10th 2022, I had to move all my virtual machines to a new cluster which does not have an L2 router in front or any other IP connectivity where I could do an easy failover.

In the old cluster, I had a simple VyOS box in front and handled all the IP traffic in two VRFs. One for the prefixes provided by Hetzner / AS24940 and the second for my own AS. Since I don’t have any IP resources at Hetzner anymore, I need a solution for the IPs within my own AS.

My first idea was to create a Router VM on each PVE node which acts as a gateway to announce the network via BGP to my route reflectors. But after all, this was a bad idea. Why? I mean, this could theoretically work.

  1. I’d have to write something to announce the IPs as a more-specific route (VMs have to be migratable), or announce less-specific networks for each node and assign the VMs to the node directly.
  2. I’d have to maintain more servers / VMs (I already have too many)
  3. The effort to keep the announcements and the actual VM state in sync isn’t worth it.

For me, the first point already killed my idea. So, how do I solve the issue?

Short story: ✨Hooks✨

Long story:

I searched for something like the libvirt qemu hook in PVE. It turns out that PVE already has hook support for a while, but there’s no documentation in the Wiki, but in the PVE Admin Guide. Unfortunately, this hook implementation does not have a “global” flag. That means you’ve to define the hook for each VM separately (easily automatable).

If a VM should run a hook script, you can set it with qm on the host:

$ qm set 9000 --hookscript local:snippets/my-hook.sh

Okay, but why do you use local:snippets and not /usr/libexec or /usr/local/bin?

That’s another issue in the PVE hookscript implementation. You can only set a hookscript stored in a PVE directory (e.g. NFS, CephFS or local:snippets).

How do you know which IPs get announced?

At some point, I need a state somewhere of the IPs. The easiest way to solve this is to save the IP config in a JSON file stored on the host . I only have ~30 VMs and two hosts in this cluster, so I can easily store this file on both servers.

Okay, that’s cool. But how do you announce the IPs now?

Network configuration #

Each host in the cluster have a bridge to which every VM with a public network is connected. This bridge acts as the gateway for virtual machines.

The config itself is very basic:

auto vmbr0
iface vmbr0 inet static
	address 172.31.1.1/32
	bridge-ports none
	bridge-stp off
	bridge-fd 0
    hwaddress ether da:f0:f1:08:55:c4
    
iface vmbr0 inet6 static
	address fe80::1/64
    post-up ip -6 ru add from all to 2a0c:9a40:804X:XXXX:XXXX::X lookup 100
    post-up ip -6 ru add from 2a0c:9a40:804X:XXXX:XXXX::X lookup 100
    post-down ip -6 ru del from all to 2a0c:9a40:804X:XXXX:XXXX::X lookup 100
    post-down ip -6 ru del from 2a0c:9a40:804X:XXXX:XXXX::X lookup 100

Hold up, why do you set a custom MAC address to the bridge?

In case of a live migration, the VM may get unreachable either temporarily or permanently because the MAC address of the gateway has changed.

JSON Config #

The config must have at least this Infos:

  • The MAC Address (I don’t query the PVE API at all)
  • The Interface
  • IPv4 and IPv6 addresses to be announced
  • The Routing Table ID

The Routing Table ID is mandatory for PBR in my case. The traffic for public VMs needs to get routed to another network, but the default route is in another routing table.

The configuration structure I use looks like that:

type IP struct {
	Family    string `json:"family"`
	Address   string `json:"address"`
}

type IPNetwork struct {
	Family  string `json:"family"`
	Network string `json:"network"`
	NextHop string `json:"next_hop"`
}

type Configuration struct {
	IPAddresses []IP        `json:"ip_addresses"`
	IPNetworks  []IPNetwork `json:"ip_networks"`
	MacAddress  string      `json:"mac_address"`
	Interface   string      `json:"interface"`
	Table       int         `json:"table"`
}

The config is bound to the VM/Domain ID. So, The config name is, for example, 100.json.

Bird (the BGP daemon) config #

Bird is a lightweight and simple BGP Daemon. The neat part about bird is it supports RFC 5549 (requires bird >2.0.9 to work with kernel export).

That means the hypervisor does not require any IPv4 address except the private gateway IP for the VMs 🥳.

To summarise what my bird config contains:

  • Router ID (must be unique and needs every host)
  • Some logging config
  • Kernel Protocol for my second kernel routing table
  • Direct protocol for next-hop lookup in some cases
  • BGP peer config

Or “in config” (only necessary parts):

$ cat bird_global.conf

router id 100.90.100.1;

ipv4 table public4;
ipv6 table public6;

protocol device {
    scan time 10;
}

protocol kernel kernel_public4 {
        learn;
        persist;
        scan time 10;
        kernel table 100;
        ipv4 {
            table public4;
            import all;
            export all;
        };
}

protocol kernel kernel_public6 {
        learn;
        persist;
        scan time 10;
        kernel table 100;
        ipv6 {
            table public6;
            import all;
            export all;
        };
}

protocol direct {
  interface "*";
  ipv6 {
    import filter {
      if (net = 2a0c:9a40:804X:XXXX:XXXX::/127) then accept;
      reject;
    };
  };
}

Why is there a filter in the direct protocol?

I only need to import the P2P routes to my routers to set the next hop correctly.

$ cat peers.conf

template bgp upstream {
    local as 4220000001;
    ipv4 {
        table public4;
        import filter {
            if net = 0.0.0.0/0 then accept;
            reject;
        };
        export filter {
            if (ifname = "vmbr0") then accept;
            reject;
        };
        next hop self;
        extended next hop;
    };
    ipv6 {
        table public6;
        import filter {
            if net = ::/0 then accept;
            reject;
        };
        export filter {
            if (ifname = "vmbr0") then accept;
            reject;
        };
        next hop self;
    };
}

protocol bgp rtr01 from upstream {
  neighbor 2a0c:9a40:804X:XXXX:XXXX:: as 213392;
}

To explain the BGP config a bit more:

local as 4220000001 I use private ASNs for the host. It has something to do with the internal structure of my network.

ipv4 {
    table public4;

This forces bird to handle the routes in my custom routing table public4 (KRT table 100 in my case).

import filter {
    if net = 0.0.0.0/0 then accept;
    reject;
};

I only need a default route from my routers. It is a precautionary measure to prevent route leaks.

export filter {
    if (ifname = "vmbr0") then accept;
    reject;
}

Routes from vmbr0 should get announced to my routers.

next hop self;

Force set the next hop to the host IP address in case it’s something else.

extended next hop;

This flag is only required for the IPv4 AFI as I use RFC 5549 to announce IPv4 with an IPv6 next hop.

protocol bgp rtr01 from upstream {} defines the session from the template upstream above.

Yeah, a lot of BGP config, but I haven’t explained how to configure the IPs yet.

Introduction to “fast-announcer” #

fast-announcer is a hook I’ve written in Go to configure IP routes, Rules (PBR) and static neighbour entries (IPv4 only). My code is extremely ugly and I recommend everyone to not use it.

fast-announcer uses the netlink library by vishvananda which works pretty well. However, I noticed that this library lacks some documentation. So, I had to reverse-engineer a bit from the _test.go files.

To keep it short fast-announcer does:

  • Parses the arguments PVE uses to run fast-announcer
  • Checks the network JSON config
  • Add or remove the IP routes, Rules and ARP entries.

PVE will run the hook four times with the following arguments:

  1. fast-announcer vm-id pre-start
  2. fast-announcer vm-id post-start
  3. fast-announcer vm-id pre-stop
  4. fast-announcer vm-id post-stop

I only care about post-start and post-stop where I add or remove the IP configuration.