Intro
I don’t think we ask a lot of our switches. A packet comes in on this port and goes out on this other port. That sort of thing. But, apparently, when the switch is a Cisco Nexus, that is asking too much. Maybe it will go to that other port. Let’s see, it depends who sent it. So then again, maybe not. So maybe device A can ping device B. Device B cannot ping device A. But it can ping device C and device C can ping A. Or maybe Jack can ping device 1 but not device 2 on the same subnet. Yet Jill can ping device 2 but not device 1!
Yes. Those are not theoretical examples, but, sadly, actual examples lifted from recent massive debugging sessions I’ve participated in, which, eventually, focused on the Nexus switch not treating all traffic as it should. Another example: larger packets were not getting through on the same vlan.
Sometimes the Cisco support engineer is too clever by half. We had one find a few errors on supervisor board 5 so we switched to board 6. That did nothing. We finally noticed that all problems were related to FEXes connected to slot 8. As a test we plugged a FEX into a different slot and, voila, things began to work better. But the Cisco guy did not see any errors on slot 8.
Well, you only get the same Cisco engineer for a limited time. Then their shift turns over and you get to explain the whole problem all over again to the next one. Yeah, supposedly they’re briefed, but ni reality not so much. So finally the second engineer was a little more pragmatic than the first too-clever guy. This is a quote from her: “I do not see errors on that module, but logically, it has to be the problem.” This of course is after we presented all the evidence to her. So we ordered an RMA, replaced that board and yes everything began to work.
I’ve been involved in two such massive debugging session over the last four months. No one from the Cisco side is saying it, but I will. These Nexus switches define flows behind the scenes. They can probably do some very clever things with these flows. But it is no longer a simple switch. And, sadly, when the flows don’t all work, no one at Cisco is prepared to look for that issue or even consider it as a possibility. They look for obscure errors. And, heaven forbid, they are, I guess, incapable of proving to themselves their switch as at fault – eating packets – by doing what any first semester network class would have, namely, mirroring some of the ports and examining the traffic for themselves. It is instead up to the customer to prove the switch is at fault.
So in my first debugging session, which lasted about 16 hours, Cisco did not really find the errors that would lead them to believe slot 8 was a problem. Yet its replacement fixed everything. In fairness, in the second debugging session, which only lasted five hours or so, they did see errors which justified a replacement. But at no time did they ever use the word flow or offer in detail how the errors they saw could have produced the weird results we were seeing (that was the Jack and Jill example from above).
Conclusion
We guys who run servers would like to think a switch is a switch is a switch. But Cisco is requiring us to up our game, even if they don’t up theirs. Their Nexus switches absolutely can eat some of your network traffic while passing other over the same ports. I’d like to call that flows, even if they refuse to use that term.