This weblog submit describes one more weird instance of how dependable digital twins are, however don’t fear; all of them work nice in PowerPoint.
After “fixing” the mixing exams to cope with ArubaCX’s notion of VXLAN VNI having 16 bits, the bridging check labored, however the IRB exams stored failing.
Within the IRB check, the lab has two layer-3 switches. Every of them ought to have the ability to bridge inside a VLAN/VXLAN phase and route throughout the segments.
For instance, H1 ought to have the ability to ping H2 (bridging), in addition to H3 and H4 (routing). The way in which netlab units up the lab, all hosts use S1 (ArubaCX) because the default gateway.

Lab diagram
That is the lab topology I used to be utilizing (it has been adjusted within the meantime to cope with ArubaCX):
module: [ vlan, vxlan, ospf ]
teams:
_auto_create: True
hosts:
members: [ h1, h2, h3, h4 ]
system: linux
supplier: clab
nodes:
s1:
system: arubacx
s2:
system: frr
supplier: clab
vlans:
purple:
vni: 5000
ospf.passive: True
hyperlinks: [ s1-h1, s2-h2 ]
blue:
vni: 5001
ospf.passive: True
hyperlinks: [ s1-h3, s2-h4 ]
hyperlinks:
- interfaces: [ s1, s2 ]
mtu: 1600
instruments:
edgeshark:
The check outcomes had been disheartening:
- The lab labored with two ArubaCX digital machines
- It failed when S2 was changed with an FRR container
- It labored (once more) when S2 was an Arista vEOS digital machine, however not when it was an Arista cEOS container.
- The whole lot began working after I rebooted my netlab server.
Fortuitously, if the earlier troubleshooting train taught me something, it was to do a packet seize earlier than losing time on the rest. netlab contains assist for the wonderful Edgeshark instrument, so I used to be capable of carry out packet seize on my Mac OS laptop computer from my Ubuntu server, which was ~50 km away (yay, Tailscale!).
Right here’s the VXLAN-encapsulated ARP request (captured on the hyperlink between S1 and S2) despatched from H1 earlier than it tries to ping H2:

And right here’s the VXLAN-encapsulated ARP request for H3 despatched by S1 (ArubaCX) when it tries to route the packet from H1 to H3:

Can you see the distinction? Though the 2 packets have the identical measurement (Ethernet frames are 92 bytes lengthy), the IP packet size and the UDP payload size don’t match. ArubaCX claims the packet comprises 4 bytes greater than it does.
Lengthy story quick: VXLAN packets generated by ArubaCX routing course of have invalid size. The probably wrongdoer is the VLAN tag connected to the packet earlier than it enters the (software program) VXLAN encapsulation course of. The VLAN tag is eliminated (accurately), however the packet size just isn’t adjusted.
Now for the enjoyable half: why did the check generally work? It’s evident that almost all VXLAN implementations don’t confirm the IP or UDP packet size (a nasty concept), or the check might by no means work with gadgets from different distributors. However why did Arista vEOS settle for the packet when it by no means reached Arista cEOS?
Welcome to the great world of Linux bridges that like to masquerade as firewalls and generally determine it’s their job to filter invalid IP site visitors. Fortuitously, we will do a packet seize on interfaces linked to a Linux bridge to confirm who the wrongdoer is, and right here’s what was occurring in my server:
- netlab makes use of libvirt UDP tunnels to create point-to-point inter-VM hyperlinks. ArubaCX VM and Arista vEOS VM are thus linked with a UDP tunnel, and the Linux bridge just isn’t concerned.
- netlab has to make use of a Linux bridge to attach ArubaCX VM and Arista cEOS (or FRR) container, and the Linux bridge dropped the packets with an invalid IP size.
- Evidently the FRR VM (Debian Bookworm) checks the IP packet size, whereas the FRR container doesn’t.
Lastly, why did issues begin to work after I rebooted the server? The firewall-on-a-bridge is an add-on module that’s not loaded at boot time, so the check works after a server reboot. Nevertheless, one thing (and I wasn’t ready to determine what) triggers the loading of that kernel module, and from that time onwards, the ArubaCX VM can not ship VXLAN-encapsulated ARP requests to adjoining containers.
Possibly we must always rename the Linux bridge to a Heisenberg bridge? If it really works, you don’t know why, and when you assume you understand how it’s configured, it really works in unpredictable methods.
This weblog submit describes one more weird instance of how dependable digital twins are, however don’t fear; all of them work nice in PowerPoint.
After “fixing” the mixing exams to cope with ArubaCX’s notion of VXLAN VNI having 16 bits, the bridging check labored, however the IRB exams stored failing.
Within the IRB check, the lab has two layer-3 switches. Every of them ought to have the ability to bridge inside a VLAN/VXLAN phase and route throughout the segments.
For instance, H1 ought to have the ability to ping H2 (bridging), in addition to H3 and H4 (routing). The way in which netlab units up the lab, all hosts use S1 (ArubaCX) because the default gateway.

Lab diagram
That is the lab topology I used to be utilizing (it has been adjusted within the meantime to cope with ArubaCX):
module: [ vlan, vxlan, ospf ]
teams:
_auto_create: True
hosts:
members: [ h1, h2, h3, h4 ]
system: linux
supplier: clab
nodes:
s1:
system: arubacx
s2:
system: frr
supplier: clab
vlans:
purple:
vni: 5000
ospf.passive: True
hyperlinks: [ s1-h1, s2-h2 ]
blue:
vni: 5001
ospf.passive: True
hyperlinks: [ s1-h3, s2-h4 ]
hyperlinks:
- interfaces: [ s1, s2 ]
mtu: 1600
instruments:
edgeshark:
The check outcomes had been disheartening:
- The lab labored with two ArubaCX digital machines
- It failed when S2 was changed with an FRR container
- It labored (once more) when S2 was an Arista vEOS digital machine, however not when it was an Arista cEOS container.
- The whole lot began working after I rebooted my netlab server.
Fortuitously, if the earlier troubleshooting train taught me something, it was to do a packet seize earlier than losing time on the rest. netlab contains assist for the wonderful Edgeshark instrument, so I used to be capable of carry out packet seize on my Mac OS laptop computer from my Ubuntu server, which was ~50 km away (yay, Tailscale!).
Right here’s the VXLAN-encapsulated ARP request (captured on the hyperlink between S1 and S2) despatched from H1 earlier than it tries to ping H2:

And right here’s the VXLAN-encapsulated ARP request for H3 despatched by S1 (ArubaCX) when it tries to route the packet from H1 to H3:

Can you see the distinction? Though the 2 packets have the identical measurement (Ethernet frames are 92 bytes lengthy), the IP packet size and the UDP payload size don’t match. ArubaCX claims the packet comprises 4 bytes greater than it does.
Lengthy story quick: VXLAN packets generated by ArubaCX routing course of have invalid size. The probably wrongdoer is the VLAN tag connected to the packet earlier than it enters the (software program) VXLAN encapsulation course of. The VLAN tag is eliminated (accurately), however the packet size just isn’t adjusted.
Now for the enjoyable half: why did the check generally work? It’s evident that almost all VXLAN implementations don’t confirm the IP or UDP packet size (a nasty concept), or the check might by no means work with gadgets from different distributors. However why did Arista vEOS settle for the packet when it by no means reached Arista cEOS?
Welcome to the great world of Linux bridges that like to masquerade as firewalls and generally determine it’s their job to filter invalid IP site visitors. Fortuitously, we will do a packet seize on interfaces linked to a Linux bridge to confirm who the wrongdoer is, and right here’s what was occurring in my server:
- netlab makes use of libvirt UDP tunnels to create point-to-point inter-VM hyperlinks. ArubaCX VM and Arista vEOS VM are thus linked with a UDP tunnel, and the Linux bridge just isn’t concerned.
- netlab has to make use of a Linux bridge to attach ArubaCX VM and Arista cEOS (or FRR) container, and the Linux bridge dropped the packets with an invalid IP size.
- Evidently the FRR VM (Debian Bookworm) checks the IP packet size, whereas the FRR container doesn’t.
Lastly, why did issues begin to work after I rebooted the server? The firewall-on-a-bridge is an add-on module that’s not loaded at boot time, so the check works after a server reboot. Nevertheless, one thing (and I wasn’t ready to determine what) triggers the loading of that kernel module, and from that time onwards, the ArubaCX VM can not ship VXLAN-encapsulated ARP requests to adjoining containers.
Possibly we must always rename the Linux bridge to a Heisenberg bridge? If it really works, you don’t know why, and when you assume you understand how it’s configured, it really works in unpredictable methods.