Xen, network-bridge, peth, and VLANs

Hot adding new VLANs to my older Xen boxes was reported as requiring a reboot, without a good explanation for why, so I set out to unravel it.

Older versions of Xen (pre 4.1) ship with a script that creates bridges - /etc/xen/scripts/network-bridge .  Adding a new VLAN to the dom0 was theoretically as easy as setting up the new VLAN interface, then building the bridge on top of it for VMs to attach to.  This would (in theory) add VLAN 52 to eth1, with a resulting bridge xenbr52:

$ cat /etc/sysconfig/network-scripts/ifcfg-eth1.52
DEVICE=eth1.52
ONBOOT=yes
VLAN=yes
$ ifup eth1.52
$ /etc/xen/scripts/network-bridge start netdev=eth1.52 vifnum=28 bridge=xenbr52

This appeared to work - no errors or other signs of distress - but packet captures against the new VLAN interface eth1.52 showed no traffic, even when I purposely triggered ARPs from working systems on the same VLAN.  Those same ARP frames could be seen on the physical interface though; something was clearly amiss and blaming the physical network was off the table.

The key observation to starting down the resolution path was looking at the driver for eth1:

$ ethtool -i eth1
driver: netloop
version: 
firmware-version: 
bus-info: vif-0-1

eth1 was a virtual interface (netloop) and not associated with a PCI bus, which is to say eth1 wasn't a physical interface.  Looking at the configured links showed very few "eth1"s but quite a few "peth1"s.   And a clear indicator that my new, not-working VLAN 52 was different from its predecessors: 

$ ip link show | egrep '@eth1|@peth1'
7: peth1.49@peth1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue
8: peth1.50@peth1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue
9: peth1.51@peth1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue 
98: peth1.52@eth1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue

The other VLANs were built on "peth1" - which upon reviewing the driver showed it was the physical interface.  Building a VLAN interface on top of a vif named eth1 doesn't work.

This business with having peths is a byproduct of the network-bridge script.  The script is meant to take an interface and set up a bridge without affecting the original non-bridged network setup.  To do this it takes an original interface and a vif, and then: creates a bridge, renames the original interface to prepended with "p" ("peth1"), renames the vif to be the original interfaces name ("eth1"), attaches the two interfaces to the bridge, moves the IP from the original to the vif, and brings all three devices up.

With this info in hand it's straightforward to fix.  Get rid of the not-functional old VLAN 52 interface and bridge, and then instead of creating the VLAN interface off of eth1, you create it off of peth1.  This can be done with modern versions of 'ip', but if you don't have those handy, 'vconfig' will do:

$ vconfig add peth1 52
Added VLAN with VID == 52 to IF -:peth1:-
$ /etc/xen/scripts/network-bridge start netdev=peth1.52 vifnum=29 bridge=xenbr52

This yields a functional, but still off-looking configuration, showing that the prepending action keeps going, renaming peth1.52 to ppeth1.52:

$ ip link show | egrep '@eth1|@peth1'
7: peth1.49@peth1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue
8: peth1.50@peth1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue
9: peth1.51@peth1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue
99: ppeth1.52@peth1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue

The final question of why it would work on reboot, but not for a hot addition, is also straightforward: when the system initially boots, the VLAN interfaces are defined prior to the network-bridge creation process.  The physical interfaces aren't renamed until after the VLANs are already up, so this problem only occurs when you're adding VLANs and the bridge creation has already occurred.

ls - grep - wc, newlines, and conditional formats

I finally sorted out a minor nuisance that I've noticed periodically over the years.  The output of 'ls' seemed to magically work like how I expected when grep'ing or wc'ing the output, but when viewing the output, it shouldn't have worked:

    user $ ls
    f1  f2  f3              #  Formatted on a single line
    user $ ls | grep f1
    f1                      #  As if "f1" was on a line by  itself
    user $ ls | wc -l
       3                #  wc observes 3 lines, when it should be one

I apparently have (had, hopefully) a blind spot for the source of this problem, because I assumed it was some nuance of my terminal, or maybe parsing.  I assumed that what comes out of a standard unix command is not conditional to where it's being sent; which in this case is completely wrong.  The answer is clear from the source for gnu coreutils ls (http://git.savannah.gnu.org/): 

      if (isatty (STDOUT_FILENO))
        {
          format = many_per_line;
          /* See description of qmark_funny_chars, above.  */
          qmark_funny_chars = true;
        }
      else
        {
          format = one_per_line;
          qmark_funny_chars = false;
        }

'ls' checks to see if standard out is going to a tty via isatty(), and formats one way, but formats a single entry per line otherwise.  A simple c program illustrates this possibility:

    #include <unistd.h>
    #include <stdio.h>

    int main(){
    if( isatty(1) == 1){
        printf("Output for tty\n");
    }
    else{ 
     printf("Output\nnot\nfor\ntty\n");
    }
    return 0;
    }

The results of running said program depend on whether standard out goes to a tty, or not:

    user $ ./a.out 
    Output for tty
    user $ ./a.out | cat
    Output
    not
    for
    tty

Openstack OverLimit errors during nova boot

I was stung by a provisioning error for a specific user in my openstack environment recently.  The quotas associated with the user's tenant was the obvious area to review, but I couldn't find any quotas which had been exceeded.

nova  boot ... --user-data file.txt newvm 

ERROR (OverLimit): Over limit (HTTP 413) (Request-ID: req-570be40d-6790-487e-8cfd-66a4d963fd68)

After much unfortunate thrashing, we discovered this user had a large user-data file.  It was sufficiently large (98k) that, after being base64 encoded, and together with the rest of the API call, it triggered the nova API max_request_body_size_opt limit (112k based on nova/api/sizelimit.py).  The key to identifying this problem was in enabling debug on the nova boot command:

nova --debug boot ...
...
RESP BODY: 413 Request Entity Too Large

Request is too large.

After thinning down the user-data to a point where your API request makes it through the first limit, we get a new ( but much more helpful! ) error:

ERROR (BadRequest): User data too large. User data must be no larger than 65535 bytes once base64 encoded. Your data is 111572 bytes (HTTP 400) (Request-ID: req-63197123-e6f8-496d-b338-c43b65b15e76)

In conclusion, be advised that the size of the API call, and by extension the size of the user-data, is a part of the limits imposed by the nova api.

Updating user-data for existing Openstack VMs

I recently had the need to update the user-data assigned to many virtual machines in Openstack.  There's no direct way to do this through the API.  However, it can be done through the backend database.  The actual user-data content is base64 encoded in the database, and there's probably a way to take the script you want and convert it, but I went with a different approach.

  1. Deploy a VM using the user-data you wish your pre-existing VM had
  2. Get the UUID for your new "good" VM - we'll call it good-uuid
    1. nova show goodvm | awk '$2=="id" {print}'
  3. Get the UUID for your old "bad" VM - we'll call it bad-uuid
  4. Then, in the database, set the user-data of the "bad" VM to the corresponding value of the "good" VM using a temporary table.  
mysql> use nova;
mysql> create temporary table instances1 like instances;
mysql> insert into instances1 select * from instances where uuid = 'good-uuid';
mysql> update instances set user_data = ( select user_data from instances1 ) where uuid = 'bad-uuid' limit 1;

And that's it.  You should be able to hop on your "bad" VM and check the new user-data with:                                               curl http://169.254.169.254/latest/user-data

Naturally you can change multiple VMs by using different where clauses as needed.  I confirmed this process works on grizzly and havana.

Tracing DHCP through linux bridges in kvm

A linux virtual machine on my Openstack environment wasn't coming on to the network.  For some reason it couldn't successfully get an address from DHCP.

This time it wasn't a DHCP service problem though; all the usual bits were in order.  So I set to digging through the network stack to track down the problem.

The DHCP server runs on one physical system; the VM in question was running on a separate physical server.  Both kvm, using the linux bonding and bridging modules and libvirt.  A quick tcpdump against bond0 on both physical servers showed that both the DHCP replies were making it to the physical bond interface on the hypervisor hosting my VM; the physical network didn't appear to be the immediate cause.

I'd never had the chance to pick apart the bridging and virtual interface pieces of a kvm/libvirt virtual machine, so this afforded the chance.

First, make sure the VM is "wired" together correctly.  The virtual topology should be:  virtual machine interface (eth0) --> tap interface --> virtual bridge --> bond. 

Ask libvirt for the tap interface, bridge, and MAC of the instance:

$ sudo virsh dumpxml instance-00000347 | grep -A 5 interface
    <interface type='bridge'>
      <mac address='fa:16:3e:56:ea:fa'/>
      <source bridge='brq5dddda71-76'/>
      <target dev='tap0b5f208b-76'/>
      <model type='virtio'/>
      <alias name='net0'/>

With the tap interface in hand - tap0b5f208b-76 - we can check how it's connected into the virtual bridge.   brctl can show which tap interfaces are connected to which bridges.

$ brctl show brq5dddda71-76
bridge name	bridge id	STP enabled	interfaces
brq5dddda71-76 8000.90b11c4fcb86	no bond0
					tap0b5f208b-76
					tap0e6de0ae-f1
					tap81dd5a89-74
					tapb207946d-0f
					tapb394540f-76
					tapdb1ab4ac-ac
					tapf4cd7b46-a7

This is good; the tap interface is associated with the bridge, as expected.  We can also see that the bridge has an interface onto the bond; again, good.

Since the virtual cabling appears to be correctly in place, the next task is straightforward: since the DHCP replies were arriving at the bond0 interface, check each of the subsequent interfaces to see at which "hop" the DHCP reply was dropped.

tcpdump against the bridge - brq5dddda71-76 - showed the DHCP replies were arriving.  However, the same tcpdump against the tap interface - tap0b5f208b-76 - did not show the DHCP replies.  For some reason, the frames weren't being transmitted from my bridge to the virtual machine to which they were addressed.

The MAC table on the bridge shows the first sign of a problem.  The linux bridge, like a normal switch, keeps a MAC table which maps an observed source MAC address to a port.   When a frame arrives at the bridge, it records the source MAC and creates an entry in the table so that it knows to forward future frames destined for that MAC down said port.

When libvirt connects a virtual machine into a bridge, to create the MAC address for the bridge side of the "cable" it uses the same MAC address as the VM, but sets the first byte to 0xFE.  Accordingly, a healthy MAC table entry for my problematic virtual machine should look like this:

 $ brctl showmacs brq5dddda71-76 | grep ea:fa
port no	mac addr	is local? ageing timer
  6	fe:16:3e:56:ea:fa	yes	  0.00
  6	fa:16:3e:56:ea:fa	no	  8.17

The key here is that the MAC for the VM ( fa:16:3e:56:ea:f ) and the MAC that libvirt assigned to the bridge end for that VM ( fe:16:3e:56:ea:f ) are both on the same port (6).  For a healthy MAC table, this means to send traffic destined for the VM down the "virtual cable" connected to the VM.  This is how it should be.

However, here is what I observed:

 $ brctl showmacs brq5dddda71-76 | grep ea:fa
port no	mac addr	is local? ageing timer
  6	fe:16:3e:56:ea:fa	yes	  0.00
  1	fa:16:3e:56:ea:fa	no	  1.35

For some reason, the MAC table learned that my virtual machine's MAC was on a port that the virtual machine wasn't connected to.  This explains why the DHCP replies which I observed making it to the bridge interface never made it to the tap interface; the MAC table on the bridge believes that  traffic for my MAC should go "somewhere else", in this case port 1.  Port 1 on this bridge is associated with the upstream network.  It connects the bridge into the bond.  It is not where traffic for my VM should be directed.

The most immediate explanation for this behavior is a MAC conflict; if another device somewhere on the same L2 segment has the same MAC as my VM, it's possible for the switching infrastructure, including my virtual bridge, to build MAC table entries which point towards it rather than my VM.

UPDATE:

System messages indicated the same problem: MAC addresses being observed where they shouldn't be.

  • brq5dddda71-76: received packet on bond0.2222 with own address as source address

That is to say, the physical bond was receiving packets with a MAC address that it knew was local, but it was receiving them from an external source.  In the end, we found a switch upstream with a mis-cabled LAG that was the cause of the L2 loop.