Tracing DHCP through linux bridges in kvm
A linux virtual machine on my Openstack environment wasn't coming on to the network. For some reason it couldn't successfully get an address from DHCP.
This time it wasn't a DHCP service problem though; all the usual bits were in order. So I set to digging through the network stack to track down the problem.
The DHCP server runs on one physical system; the VM in question was running on a separate physical server. Both kvm, using the linux bonding and bridging modules and libvirt. A quick tcpdump against bond0 on both physical servers showed that both the DHCP replies were making it to the physical bond interface on the hypervisor hosting my VM; the physical network didn't appear to be the immediate cause.
I'd never had the chance to pick apart the bridging and virtual interface pieces of a kvm/libvirt virtual machine, so this afforded the chance.
First, make sure the VM is "wired" together correctly. The virtual topology should be: virtual machine interface (eth0) --> tap interface --> virtual bridge --> bond.
Ask libvirt for the tap interface, bridge, and MAC of the instance:
$ sudo virsh dumpxml instance-00000347 | grep -A 5 interface <interface type='bridge'> <mac address='fa:16:3e:56:ea:fa'/> <source bridge='brq5dddda71-76'/> <target dev='tap0b5f208b-76'/> <model type='virtio'/> <alias name='net0'/>
With the tap interface in hand - tap0b5f208b-76 - we can check how it's connected into the virtual bridge. brctl can show which tap interfaces are connected to which bridges.
$ brctl show brq5dddda71-76 bridge name bridge id STP enabled interfaces brq5dddda71-76 8000.90b11c4fcb86 no bond0 tap0b5f208b-76 tap0e6de0ae-f1 tap81dd5a89-74 tapb207946d-0f tapb394540f-76 tapdb1ab4ac-ac tapf4cd7b46-a7
This is good; the tap interface is associated with the bridge, as expected. We can also see that the bridge has an interface onto the bond; again, good.
Since the virtual cabling appears to be correctly in place, the next task is straightforward: since the DHCP replies were arriving at the bond0 interface, check each of the subsequent interfaces to see at which "hop" the DHCP reply was dropped.
tcpdump against the bridge - brq5dddda71-76 - showed the DHCP replies were arriving. However, the same tcpdump against the tap interface - tap0b5f208b-76 - did not show the DHCP replies. For some reason, the frames weren't being transmitted from my bridge to the virtual machine to which they were addressed.
The MAC table on the bridge shows the first sign of a problem. The linux bridge, like a normal switch, keeps a MAC table which maps an observed source MAC address to a port. When a frame arrives at the bridge, it records the source MAC and creates an entry in the table so that it knows to forward future frames destined for that MAC down said port.
When libvirt connects a virtual machine into a bridge, to create the MAC address for the bridge side of the "cable" it uses the same MAC address as the VM, but sets the first byte to 0xFE. Accordingly, a healthy MAC table entry for my problematic virtual machine should look like this:
$ brctl showmacs brq5dddda71-76 | grep ea:fa port no mac addr is local? ageing timer 6 fe:16:3e:56:ea:fa yes 0.00 6 fa:16:3e:56:ea:fa no 8.17
The key here is that the MAC for the VM ( fa:16:3e:56:ea:f ) and the MAC that libvirt assigned to the bridge end for that VM ( fe:16:3e:56:ea:f ) are both on the same port (6). For a healthy MAC table, this means to send traffic destined for the VM down the "virtual cable" connected to the VM. This is how it should be.
However, here is what I observed:
$ brctl showmacs brq5dddda71-76 | grep ea:fa port no mac addr is local? ageing timer 6 fe:16:3e:56:ea:fa yes 0.00 1 fa:16:3e:56:ea:fa no 1.35
For some reason, the MAC table learned that my virtual machine's MAC was on a port that the virtual machine wasn't connected to. This explains why the DHCP replies which I observed making it to the bridge interface never made it to the tap interface; the MAC table on the bridge believes that traffic for my MAC should go "somewhere else", in this case port 1. Port 1 on this bridge is associated with the upstream network. It connects the bridge into the bond. It is not where traffic for my VM should be directed.
The most immediate explanation for this behavior is a MAC conflict; if another device somewhere on the same L2 segment has the same MAC as my VM, it's possible for the switching infrastructure, including my virtual bridge, to build MAC table entries which point towards it rather than my VM.
UPDATE:
System messages indicated the same problem: MAC addresses being observed where they shouldn't be.
brq5dddda71-76: received packet on bond0.2222 with own address as source address
That is to say, the physical bond was receiving packets with a MAC address that it knew was local, but it was receiving them from an external source. In the end, we found a switch upstream with a mis-cabled LAG that was the cause of the L2 loop.