Xen, network-bridge, peth, and VLANs

Hot adding new VLANs to my older Xen boxes was reported as requiring a reboot, without a good explanation for why, so I set out to unravel it.

Older versions of Xen (pre 4.1) ship with a script that creates bridges - /etc/xen/scripts/network-bridge .  Adding a new VLAN to the dom0 was theoretically as easy as setting up the new VLAN interface, then building the bridge on top of it for VMs to attach to.  This would (in theory) add VLAN 52 to eth1, with a resulting bridge xenbr52:

$ cat /etc/sysconfig/network-scripts/ifcfg-eth1.52
DEVICE=eth1.52
ONBOOT=yes
VLAN=yes
$ ifup eth1.52
$ /etc/xen/scripts/network-bridge start netdev=eth1.52 vifnum=28 bridge=xenbr52

This appeared to work - no errors or other signs of distress - but packet captures against the new VLAN interface eth1.52 showed no traffic, even when I purposely triggered ARPs from working systems on the same VLAN.  Those same ARP frames could be seen on the physical interface though; something was clearly amiss and blaming the physical network was off the table.

The key observation to starting down the resolution path was looking at the driver for eth1:

$ ethtool -i eth1
driver: netloop
version: 
firmware-version: 
bus-info: vif-0-1

eth1 was a virtual interface (netloop) and not associated with a PCI bus, which is to say eth1 wasn't a physical interface.  Looking at the configured links showed very few "eth1"s but quite a few "peth1"s.   And a clear indicator that my new, not-working VLAN 52 was different from its predecessors: 

$ ip link show | egrep '@eth1|@peth1'
7: peth1.49@peth1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue
8: peth1.50@peth1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue
9: peth1.51@peth1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue 
98: peth1.52@eth1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue

The other VLANs were built on "peth1" - which upon reviewing the driver showed it was the physical interface.  Building a VLAN interface on top of a vif named eth1 doesn't work.

This business with having peths is a byproduct of the network-bridge script.  The script is meant to take an interface and set up a bridge without affecting the original non-bridged network setup.  To do this it takes an original interface and a vif, and then: creates a bridge, renames the original interface to prepended with "p" ("peth1"), renames the vif to be the original interfaces name ("eth1"), attaches the two interfaces to the bridge, moves the IP from the original to the vif, and brings all three devices up.

With this info in hand it's straightforward to fix.  Get rid of the not-functional old VLAN 52 interface and bridge, and then instead of creating the VLAN interface off of eth1, you create it off of peth1.  This can be done with modern versions of 'ip', but if you don't have those handy, 'vconfig' will do:

$ vconfig add peth1 52
Added VLAN with VID == 52 to IF -:peth1:-
$ /etc/xen/scripts/network-bridge start netdev=peth1.52 vifnum=29 bridge=xenbr52

This yields a functional, but still off-looking configuration, showing that the prepending action keeps going, renaming peth1.52 to ppeth1.52:

$ ip link show | egrep '@eth1|@peth1'
7: peth1.49@peth1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue
8: peth1.50@peth1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue
9: peth1.51@peth1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue
99: ppeth1.52@peth1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue

The final question of why it would work on reboot, but not for a hot addition, is also straightforward: when the system initially boots, the VLAN interfaces are defined prior to the network-bridge creation process.  The physical interfaces aren't renamed until after the VLANs are already up, so this problem only occurs when you're adding VLANs and the bridge creation has already occurred.