Categories
Admin IT Operational Excellence Network Technologies

The IT Detective Agency: ARP Entry OK, PING not Working

Intro
Yes, the It detective agency is back by popular demand. This time we’ve got ourselves a thriller involving a piece of equipment – a wireless LAN controller, WLAN – on a directly connected network. From the router we could see the arp entry for the WLAN, but we could not PING it. Why?

A trace, or more correctly the output of tcpdump run on the router interface connected to that network, showed this:

>
12:08:59.623509  I arp who-has rtr7687.drjohnhilgarts.com tell wlan.drjohnhilgarts.com
12:08:59.623530  O arp reply rtr7687.drjohnhilgarts.com is-at 01:a1:00:74:55:12 (oui Nokia Internet Communications)
12:09:01.272922  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:03.271765  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:05.271469  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:07.271885  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:09.271804  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:09.622902  I arp who-has rtr7687.drjohnhilgarts.com tell wlan.drjohnhilgarts.com
12:09:09.622922  O arp reply rtr7687.drjohnhilgarts.com is-at 01:a1:00:74:55:12 (oui Nokia Internet Communications)
12:09:11.271567  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:13.271716  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:15.271971  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:17.040748  I b8:c7:5d:19:b9:9e (oui Unknown) > Broadcast Null Unnumbered, xid, Flags [Command], length 46
12:09:17.271663  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:19.271832  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:19.392578  I b8:c7:5d:19:b9:9e (oui Unknown) > Broadcast Null Unnumbered, xid, Flags [Command], length 46
12:09:19.623515  I arp who-has rtr7687.drjohnhilgarts.com tell wlan.drjohnhilgarts.com
12:09:19.623535  O arp reply rtr7687.drjohnhilgarts.com is-at 01:a1:00:74:55:12 (oui Nokia Internet Communications)
12:09:20.478397  O arp reply rtr7687.drjohnhilgarts.com is-at 01:a1:00:74:55:12 (oui Nokia Internet Communications)
12:09:21.271714  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:23.271697  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:25.271664  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:27.272156  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:29.271730  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:29.621882  I arp who-has rtr7687.drjohnhilgarts.com tell wlan.drjohnhilgarts.com
12:09:29.621903  O arp reply rtr7687.drjohnhilgarts.com is-at 01:a1:00:74:55:12 (oui Nokia Internet Communications)
12:09:31.271765  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:33.271858  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43

What’s interesting is what isn’t present. No PINGs. No unicast traffic whatsoever, yet we knew the WLAN was generating traffic. The frequent arp requests for the same IP strongly hinted that the WLAN was not getting the response. We were not able to check the arp table of the WLAN. And we knew the WLAN was supposed to respond to our PINGs, but it wasn’t. Yet the router’s arp table had the correct entry for the WLAN, so we knew it was plugged into the right switch port and on the right vlan. We also triple-checked that the network masks matched on both devices. Let’s go back. Was it really on the right vlan??

The Solution
What we eventually realized is that in the WLAN GUI, VLANs were assigned to the various interfaces. the switch port, on a Cisco switch, was a regular access port. We reasoned (documentation was scarce) that the interface was vlan tagging its traffic. So we tried to change the access port to a trunk port and enter the correct vlan. Here’s the show conf snippet:

interface GigabitEthernet1/17
 description 5508-wlan
 switchport
 switchport trunk encapsulation dot1q
 switchport trunk allowed vlan 887
 switchport mode trunk
 spanning-tree portfast edge trunk

Bingo! With that in place we could ping the WLAN and it could send us its traffic.

Case closed.

2018 update
I had totally forgotten my own posting. And I’ll be damned if in the heat of connecting a new firewall to a switch port we didn’t have this weird situation where we could see MAC entries of the firewall, and it could see MACs of other devices on that vlan, but nobody could ping the firewall and vica versa. A trace from tcpdump looked roughly similar to the above – a lot of arp who-has firewall, tell server. Sure enough, the firewall guy, new to the group, had configured all his ports to be tagged ports, even those with a single vlan. It had been our custom to make single vlans non-tagged ports. I didn’t start it, that’s just how it was. More than an hour was lost debugging…

And earlier in the year was yet another similar incident, where a router operated by a vendor joining one of our vlans assumed tagged ports where we did not. More than an hour was lost debugging… See a pattern there?

I had forgotten my own post from seven years ago to such an extent, I was just about to write a new one when I thought, Maybe I’ve covered that before. So old topics are new once again… Here’s to remember this for the next time!

Where to watch out for this
When you don’t run all the equipment. If you ran it all you’d have the presence of mind to make all the ports consistent.

Some terminology
A tagged port can also be known as using 802.1q, which is also known as dot1q, which in Cisco world is known as a trunk port. In the absence of that, you would have an access port (Cisco terminology) or untagged port (everyone else).

Conclusion
OK, there are probably many reasons and scenarios in which devices on the same network can see each other’s arp entries, but not send unicast traffic. But, the scenario we have laid out above definitely produces that effect, so keep it in mind as a possibility should you ever encounter this issue.

Categories
IT Operational Excellence Network Technologies

Swapping Servers while Preserving IPs – What Might Go Wrong

The Setup

I had this experience last week. I needed to swap a virtual server in place of a physical server. I had all the access I needed on both servers to do the necessary network changes, which is how I customarily do these things. I implement network configuration changes as opposed to, say, plugging cables in and pulling others out.

The Issue

Anyways, this sounded straightforward enough.  The physical server had  a backup interface on a different segment – one that I could access from the backup interface of another server also on that backup segment (so that I wouldn’t disconnect myself as I was shutting down the primary interface).  So as I was saying, simple: shutdown the primary interface on the physical server, configure the VM’s intereface similar to the physical server, reboot the virtual server so the interface changes take effect and can be seen to be working even after a reboot.  But it didn’t work, or more precisely, it half-worked.  Why?

A Trail of Hints

Here’s what I didn’t yet say that turns out that has a significant role though I did not know it at the time.  See, that interface had two IPs defined, a primary and a virtual, I’ll call it secondary since virtual is a loaded term, IP, both on the same segment, i.e., eth0 and eth0:ns2v.  After the switch eth0 was working OK, but eth0:ns2v was not!  I also need to mention how they are used, from a network perspective, to see if you are following the hints and can guess what the problem might be before I spell it out.  I have different DNS servers bound to these interfaces.  They are resolving DNS servers.  It actually does not matter (another hint!) but the OS is SLES 11.

Final hint: eth0 probably took a few minutes to work, eth0:ns2v was not working even after 17 minutes.  By not working I mean that I could see the interface on the VM come up OK, my DNS server was bound to it and I could send it queries from the VM itself.  But queries from servers on other segments to this secondary were not being returned.  I tried a trace on the VM: tcpdump -i eth0:ns2v  (OK. I lied.  More hints on the way.  This is how you solve such problems!), while doing a PING from a server on another segment. Nothing coming in.  PINGs and DNS queries to primary interface were coming in fine however.  So I know I had my routing correct.

Biggest hint of all: I could PING this secondary interface from another server on the same segment!

So what the heck is going on here?  And it’s late at night of course so no one is disturbed by this change.  I begin to suspect the router.  After all, everything else is good, right?

Do I bother the network guy to fix his router?  For me that’s akin to admitting failture to plan.  So, no, I don’t want to.  That secondary interface isn’t that important.  it could wait until morning.  But it nags at me…

First Inkling

Then it hits me.  The Aha moment.  Let me back up.  Like I said I become convinced that the router is simply wrong.  But it’s one device I do not have any administrative access to.  What do I mean by “wrong” from a network engineer’s point-of-view?  I became convinced that its ARP table hadn’t aged out its entry for the secondary IP as it ought to have.  All hosts maintain an ARP table which stores the correspondence between IP address and MAC address of other devices on the same segment.  It’s how a device “knows” to talk to the right device when an application specifies an IP address – by correctly converting it into a MAC address.  So, you see, I preserved IPs.  But what if  the router held onto the old MAC address for the secondary IP?  It would try to send traffic destined for that IP which came in from other segments to the wrong place, or no place at all, since the old MAC was now offline.  I’m not exactly sure what happens to those packets.  I’d have to investigate and think about it some more.  Could be they get sent out via the switch but dropped by the switch as there’s no place to deliver them.

But the one IP, the main one, was working.  If you can’t solve what’s wrong, it’s a good idea to review what’s gone right amongst the things which are closely related.  And try to understand the difference in the two cases.

Aha Moment

That was the real Aha moment.  A server is always doing a bit of communication.  This and that chatter.  I realized the router was seeing some of that, and that it was all coming from the main IP.  Why? Because that’s just how things work in IPv4.  Usually.  So it made some sense that the router would update the ARP entry of the main IP.  After all it was seeing these packets come to it which contained the new MAC address/old IP address.  So it probably “knew” to update its ARP table with the new MAC from those packets.  But it wasn’t getting any packets that contained the new MAC address/old secondary IP address combination!  Knowing this situation, you would hope, as a reasonable person, that there would be an ARP table timer on all the ARP entries and they would simply age out and be renewed from time-to-time to prevent just such a situation.  You would hope, but it wasn’t happening.  17 minutes is a long time.  I later learned that this was an old router.  Supposedly it has an ARP timer of five or ten minutes.  But I know that isn’t correct. 

But I did not know any of that at the time.  I knew the main interface worked, the secondary didn’t.   Packets were streaming out of the primary to the router, no packets were streaming from the secondary to the router.  So I asked myself: how can I send packets from the secondary interface??  How do you do that?

I suspected two ways offhand.  I’m sure there are lots of others.  I suspected PING could do it.  Check the man pages.  Yup. ping -I interface_address.  Bingo.  I decided to ping the router, which is, of course, my default gateway, with the secondary IP as source.  The packets were returned.  Good.  Then I noticed that my monitors were completing.  I checked seconds later.  Sure enough, I could now reach that secondary IP from other segments.  Yeah!  Problem resolved.

Mystery solved, and no cold call to the networking group required.

Tying Up the Loose Ends

What would I have asked for if I had called the networking group?  I would have told them I suspect the ARP table on their router was not updating and could they please delete the ARP entry for that secondary IP, that’s what.  That’s what I would have done right away myself if I had had that kind of access.  On *ix devices there is usually a command like arp -d ip_address to delete a specific ARP entry. 

This also explains why another device on that segment could see that secondary IP while at the same time the router couldn’t.  The other device obviously had a more well-behaved ARP time-out mechanism.  Or perhaps it  didn’t but it had had no ARP entry for that secondary IP until after the server switch?  And of course the way modern switches work the traffic is all directed and carved up.  So the communication between those two devices, which would have contained nice and uptodate MAC/IP entries was completely segregated and none of it would have been seen by the router, so in that sense wasn’t helping any.  And what was the other way to send packets from a specific IP?  dig.  I use dig constantly in my capacity as a DNS admin, so I was aware it also allows you to specify your source IP address (dig -b).  Another way that most people would have thought of?  nmap.  I haven’t really checked, but I’m willing to bet nmap could easily also have been used.  But that’s kinf of a “nasty” utility and actually isn’t normally available on self-respecting servers.  It certainly wasn’t on this one.  sendmail MTA could also be used for this same purpose (setting the source IP), but that’s a pain in the rear to set up.  As I say there are probably lots of other utilities that do this.  nc or netcat, depending on your version of Linux, may also be promising.  The aspiring programmers could write a simple PERL (or pick your language) client to do the same thing, etc.   I now see that even telnet allows you to specify your source IP with the -b switch.  So it seems to be a fairly common feature – though not universal, just try to find it on an FTP client – in most networking utilities.

An IT person benefits from having lots of tools which accomplish the same things in different ways.

More Details As Time Permits