How we got a little extra oomph from our firewall cluster, and why this trick no longer works for us

Intro
I was a running some Checkpoint firewalls in a cluster. In fact it’s been that way for years and years. At some point you get comfortable and forget to challenge and understand how it was set up. In this case re-examining the setup rewarded us with temporary survival as we were able to offload the primary member. Read on for the details…

The details
This firewall cluster included an active/standby pairing – a Nokia cluster with no state sync. The active firewall, an older model, was often hitting 99% or even 100% cpu usage on a daily basis. Dropped packets were correlated with these cpu spikes, and time-sensitive protocols, especially SIP used by IP phones, suffered mightily. Call quality often degraded, or the call was altogether dropped.

Some other relevant facts in this case: these firewalls were not doing NAT, they were acting more like routers with a firewall function. There are a handful of key servers behind them, like a VPN concentrator, a proxy, a Juniper ISG VPN concentrator, etc. On the external side was an Internet router, also under our control.

So the breakthrough was in revisiting what makes them active/passive in the first place. We weren’t relying on Checkpoint clustering. We used VRRP, defined through a Voyager setup. Then we set up our routing on all protected devices to use these VRRP IPs for their default routes. It all worked great until more and more usage crept in and then complaints started rolling in.

Upgrading costs $$ and the procurement cycle takes some time. What to do immediately, if anything?

The loudest complaints were from users of the Juniper ISG SSL VPN concentrators, who ran VOIP over those connections. What I realized (which of course is obvious in hindsight), is that this device could have all its traffic routed to the standby firewall where there was no cpu load whatsoever, and leave everything else on the active firewall.

How we did it
This was accomplished by adjusting the default route of the ISG to use the physical IP of the standby firewall, as opposed to the VRRP IP. Then, to avoid asymmetric routing, a host route was defined on the Internet router for this ISG, using as gateway the physical external IP of the standby firewall (again as opposed to the external VRRP IP.)

How it worked
It worked like a charm. We were well below our Internet link capacity, after all. So the master firewall was really the chokepoint for this voice traffic. Once we got it onto this unused firewall all the complaints stopped.

This is of course just a stop-gap measure because of course now we have no redundancy if we lose one of the firewalls. But meanwhile we’ve bought some time and kept the work-from-home users running smoothly. The master firewall still hits 99% cpu, but not quite as frequently. It’s difficult to find a true root cause, but an upgrade is definitely in order. Acceleration is already in place.

Why it won’t work for you – Checkpoint Cluster
Fast-forward five years and I tried this same trick which has served me well over the years. No worky. Why? Well these days we’ve switched to use of a Checkpoint Cluster with SYNC. In a Checkpoint cluster the secondary firewall will not forward traffic. In fact a firewall guy was the first to inform me of that. I didn’t believe him so I went ahead and configured it anyway. Sure enough, it simply didn’t work.

So for us, this trick has played itself out. But we used it multiple times during the five years it was available to us.

Conclusion
By re-visiting some old design principles were we able to get a little more mileage out of our firewalls and buy ourselves some time until we can do a planned upgrade.

Leave a Reply Cancel reply