Categories
Admin IT Operational Excellence Network Technologies

The IT Detective Agency: ARP Entry OK, PING not Working

Intro
Yes, the It detective agency is back by popular demand. This time we’ve got ourselves a thriller involving a piece of equipment – a wireless LAN controller, WLAN – on a directly connected network. From the router we could see the arp entry for the WLAN, but we could not PING it. Why?

A trace, or more correctly the output of tcpdump run on the router interface connected to that network, showed this:

>
12:08:59.623509  I arp who-has rtr7687.drjohnhilgarts.com tell wlan.drjohnhilgarts.com
12:08:59.623530  O arp reply rtr7687.drjohnhilgarts.com is-at 01:a1:00:74:55:12 (oui Nokia Internet Communications)
12:09:01.272922  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:03.271765  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:05.271469  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:07.271885  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:09.271804  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:09.622902  I arp who-has rtr7687.drjohnhilgarts.com tell wlan.drjohnhilgarts.com
12:09:09.622922  O arp reply rtr7687.drjohnhilgarts.com is-at 01:a1:00:74:55:12 (oui Nokia Internet Communications)
12:09:11.271567  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:13.271716  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:15.271971  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:17.040748  I b8:c7:5d:19:b9:9e (oui Unknown) > Broadcast Null Unnumbered, xid, Flags [Command], length 46
12:09:17.271663  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:19.271832  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:19.392578  I b8:c7:5d:19:b9:9e (oui Unknown) > Broadcast Null Unnumbered, xid, Flags [Command], length 46
12:09:19.623515  I arp who-has rtr7687.drjohnhilgarts.com tell wlan.drjohnhilgarts.com
12:09:19.623535  O arp reply rtr7687.drjohnhilgarts.com is-at 01:a1:00:74:55:12 (oui Nokia Internet Communications)
12:09:20.478397  O arp reply rtr7687.drjohnhilgarts.com is-at 01:a1:00:74:55:12 (oui Nokia Internet Communications)
12:09:21.271714  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:23.271697  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:25.271664  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:27.272156  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:29.271730  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:29.621882  I arp who-has rtr7687.drjohnhilgarts.com tell wlan.drjohnhilgarts.com
12:09:29.621903  O arp reply rtr7687.drjohnhilgarts.com is-at 01:a1:00:74:55:12 (oui Nokia Internet Communications)
12:09:31.271765  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:33.271858  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43

What’s interesting is what isn’t present. No PINGs. No unicast traffic whatsoever, yet we knew the WLAN was generating traffic. The frequent arp requests for the same IP strongly hinted that the WLAN was not getting the response. We were not able to check the arp table of the WLAN. And we knew the WLAN was supposed to respond to our PINGs, but it wasn’t. Yet the router’s arp table had the correct entry for the WLAN, so we knew it was plugged into the right switch port and on the right vlan. We also triple-checked that the network masks matched on both devices. Let’s go back. Was it really on the right vlan??

The Solution
What we eventually realized is that in the WLAN GUI, VLANs were assigned to the various interfaces. the switch port, on a Cisco switch, was a regular access port. We reasoned (documentation was scarce) that the interface was vlan tagging its traffic. So we tried to change the access port to a trunk port and enter the correct vlan. Here’s the show conf snippet:

interface GigabitEthernet1/17
 description 5508-wlan
 switchport
 switchport trunk encapsulation dot1q
 switchport trunk allowed vlan 887
 switchport mode trunk
 spanning-tree portfast edge trunk

Bingo! With that in place we could ping the WLAN and it could send us its traffic.

Case closed.

2018 update
I had totally forgotten my own posting. And I’ll be damned if in the heat of connecting a new firewall to a switch port we didn’t have this weird situation where we could see MAC entries of the firewall, and it could see MACs of other devices on that vlan, but nobody could ping the firewall and vica versa. A trace from tcpdump looked roughly similar to the above – a lot of arp who-has firewall, tell server. Sure enough, the firewall guy, new to the group, had configured all his ports to be tagged ports, even those with a single vlan. It had been our custom to make single vlans non-tagged ports. I didn’t start it, that’s just how it was. More than an hour was lost debugging…

And earlier in the year was yet another similar incident, where a router operated by a vendor joining one of our vlans assumed tagged ports where we did not. More than an hour was lost debugging… See a pattern there?

I had forgotten my own post from seven years ago to such an extent, I was just about to write a new one when I thought, Maybe I’ve covered that before. So old topics are new once again… Here’s to remember this for the next time!

Where to watch out for this
When you don’t run all the equipment. If you ran it all you’d have the presence of mind to make all the ports consistent.

Some terminology
A tagged port can also be known as using 802.1q, which is also known as dot1q, which in Cisco world is known as a trunk port. In the absence of that, you would have an access port (Cisco terminology) or untagged port (everyone else).

Conclusion
OK, there are probably many reasons and scenarios in which devices on the same network can see each other’s arp entries, but not send unicast traffic. But, the scenario we have laid out above definitely produces that effect, so keep it in mind as a possibility should you ever encounter this issue.

Categories
Admin IT Operational Excellence Network Technologies

Internet Service Providers Block TCP Port 22 or Do They?

Intro
The original premise of this article is that some Internet Service Providers were seen to block TCP port 22, used by ssh and sftp. However, as often happens during active IT investigations, this turns out to be completely wrong. In fact there was a block in this case we studied, but not by the ISPs. An overly aggressive ACL on the customer premise equipment Internet router is in fact the culprit.

The Problem
(IPs skewed to protect whatever) We asked a partner to do an sftp to drjohnstechtalk.com. All firewall and routing rules were in place. The partner tried it. He saw a SYN packet leaving, but no packets being returned. Here at drjohnstechtalk, we didn’t see any packets whatsoever! This partner makes sftp connections to other servers successfully. What the heck?

We had them try the following basic command:

nc -v host 22

where host is the IP of the target server. The response was:

nc: connect to host port 22 (tcp) failed: No route to host

But switching to port 21 (FTP) showed completely different behaviour: there was no message whatsoever and the session hanged. That’s good! That’s the usual firewalls dropping packets. But this No route to host needs more exploration.

Getting Closer
So we did an open trace. I mean a tcpdump without any limiting expression. The dump showed the SYN out to port 22, followed by this nugget:

13:09:14.279176 IP Sprint_IP > src_IP: ICMP host target_IP unreachable - admin prohibited filter, length 36

Next Steps
This well-intentioned filtering is causing a business problem. The Cisco IOS ACL that got them into trouble was this one:

ip access-list extended drop-spoof-and-telnet
 deny   tcp any any eq 22 log-input

Solution
They liked the idea of this filtering, but apparently this was the first request for inbound ssh access. So they decided to keep this filter rule but precede it with more specific rules as required, essentially acting like a second firewall:

ip access-list extended drop-spoof-and-telnet
 permit tcp host IP_src host IP_dest eq 22
  deny   tcp any any eq 22 log-input
Categories
Admin Apache IT Operational Excellence Security

The Basics of How to Work with Cipher Settings

Trying to upgrade WordPress brings a thicket of problemsDecember, 2014 Update With some tips for making your server POODLE-proof, and 2016 update to deal with OpenSSL Padding Oracle Vulnerability CVE-2016-2107

Intro
We got audited. There’s always something they catch, right? But I actually appreciate the thoroughness of this audit, and I used its findings to learn a little about one of those mystery areas that never seemed to matter until now: ciphers. Now it matters because cipher weakness was the finding!

I had an older piece of Nortel gear which was running SSL. The auditors found that it allows anonymous authentication ciphers. Have you ever heard of such a thing? I hadn’t either! I am far from an expert in this area, but I will attempt an explanation of the implication of this weakness which, by the way, was scored as a “high severity” – the highest on their scale in fact!

Why Anonymous Authentication is a Severe Matter
The briefly stated reason in the finding is that it allows for a Man In the Middle (MITM) attack. I’ve given it some thought and I haven’t figured out what the core issue is. The correct behaviour is for a client to authenticate a server in an SSL session, usually using RSA. If no authentication occurs, a MITM SSL server could be inserted in between client and server, or so they say.

Reproducing the Problem
OK, so we don’t understand the issue, but we do know enough to reproduce their results. That is helpful so we’ll know when we’ve resolved it without going back to the auditors. Our tool of choice is openssl. In theory, you can list the available ciphers in openssl thus:

openssl ciphers -v

And you’ll probably end up with an output looking like this, without the header which I’ve added for convenience:

Cipher Name|SSL Protocol|Key exchange algorithm|Authentication|Encryption algorithm|MAC digest algorithm
DHE-RSA-AES256-SHA      SSLv3 Kx=DH       Au=RSA  Enc=AES(256)  Mac=SHA1
DHE-DSS-AES256-SHA      SSLv3 Kx=DH       Au=DSS  Enc=AES(256)  Mac=SHA1
AES256-SHA              SSLv3 Kx=RSA      Au=RSA  Enc=AES(256)  Mac=SHA1
KRB5-DES-CBC3-MD5       SSLv3 Kx=KRB5     Au=KRB5 Enc=3DES(168) Mac=MD5
KRB5-DES-CBC3-SHA       SSLv3 Kx=KRB5     Au=KRB5 Enc=3DES(168) Mac=SHA1
EDH-RSA-DES-CBC3-SHA    SSLv3 Kx=DH       Au=RSA  Enc=3DES(168) Mac=SHA1
EDH-DSS-DES-CBC3-SHA    SSLv3 Kx=DH       Au=DSS  Enc=3DES(168) Mac=SHA1
DES-CBC3-SHA            SSLv3 Kx=RSA      Au=RSA  Enc=3DES(168) Mac=SHA1
DES-CBC3-MD5            SSLv2 Kx=RSA      Au=RSA  Enc=3DES(168) Mac=MD5
DHE-RSA-AES128-SHA      SSLv3 Kx=DH       Au=RSA  Enc=AES(128)  Mac=SHA1
DHE-DSS-AES128-SHA      SSLv3 Kx=DH       Au=DSS  Enc=AES(128)  Mac=SHA1
AES128-SHA              SSLv3 Kx=RSA      Au=RSA  Enc=AES(128)  Mac=SHA1
RC2-CBC-MD5             SSLv2 Kx=RSA      Au=RSA  Enc=RC2(128)  Mac=MD5
KRB5-RC4-MD5            SSLv3 Kx=KRB5     Au=KRB5 Enc=RC4(128)  Mac=MD5
KRB5-RC4-SHA            SSLv3 Kx=KRB5     Au=KRB5 Enc=RC4(128)  Mac=SHA1
RC4-SHA                 SSLv3 Kx=RSA      Au=RSA  Enc=RC4(128)  Mac=SHA1
RC4-MD5                 SSLv3 Kx=RSA      Au=RSA  Enc=RC4(128)  Mac=MD5
RC4-MD5                 SSLv2 Kx=RSA      Au=RSA  Enc=RC4(128)  Mac=MD5
KRB5-DES-CBC-MD5        SSLv3 Kx=KRB5     Au=KRB5 Enc=DES(56)   Mac=MD5
KRB5-DES-CBC-SHA        SSLv3 Kx=KRB5     Au=KRB5 Enc=DES(56)   Mac=SHA1
EDH-RSA-DES-CBC-SHA     SSLv3 Kx=DH       Au=RSA  Enc=DES(56)   Mac=SHA1
EDH-DSS-DES-CBC-SHA     SSLv3 Kx=DH       Au=DSS  Enc=DES(56)   Mac=SHA1
DES-CBC-SHA             SSLv3 Kx=RSA      Au=RSA  Enc=DES(56)   Mac=SHA1
DES-CBC-MD5             SSLv2 Kx=RSA      Au=RSA  Enc=DES(56)   Mac=MD5
EXP-KRB5-RC2-CBC-MD5    SSLv3 Kx=KRB5     Au=KRB5 Enc=RC2(40)   Mac=MD5  export
EXP-KRB5-DES-CBC-MD5    SSLv3 Kx=KRB5     Au=KRB5 Enc=DES(40)   Mac=MD5  export
EXP-KRB5-RC2-CBC-SHA    SSLv3 Kx=KRB5     Au=KRB5 Enc=RC2(40)   Mac=SHA1 export
EXP-KRB5-DES-CBC-SHA    SSLv3 Kx=KRB5     Au=KRB5 Enc=DES(40)   Mac=SHA1 export
EXP-EDH-RSA-DES-CBC-SHA SSLv3 Kx=DH(512)  Au=RSA  Enc=DES(40)   Mac=SHA1 export
EXP-EDH-DSS-DES-CBC-SHA SSLv3 Kx=DH(512)  Au=DSS  Enc=DES(40)   Mac=SHA1 export
EXP-DES-CBC-SHA         SSLv3 Kx=RSA(512) Au=RSA  Enc=DES(40)   Mac=SHA1 export
EXP-RC2-CBC-MD5         SSLv3 Kx=RSA(512) Au=RSA  Enc=RC2(40)   Mac=MD5  export
EXP-RC2-CBC-MD5         SSLv2 Kx=RSA(512) Au=RSA  Enc=RC2(40)   Mac=MD5  export
EXP-KRB5-RC4-MD5        SSLv3 Kx=KRB5     Au=KRB5 Enc=RC4(40)   Mac=MD5  export
EXP-KRB5-RC4-SHA        SSLv3 Kx=KRB5     Au=KRB5 Enc=RC4(40)   Mac=SHA1 export
EXP-RC4-MD5             SSLv3 Kx=RSA(512) Au=RSA  Enc=RC4(40)   Mac=MD5  export
EXP-RC4-MD5             SSLv2 Kx=RSA(512) Au=RSA  Enc=RC4(40)   Mac=MD5  export

I’m not going to explain all those headers because, umm, I don’t know myself. Perhaps in a later or updated posting. The point I want to make here is that as complete as this listing appears, it’s really incomplete. openssl actually supports additional ciphers as well, as I learned by combining information from the audit, plus Nortel’s documentation. In particular Nortel mentions additional ciphers such as these:

ADH-AES256-SHA SSLv3 DH, NONE AES (256) SHA1
ADH-DES-CBC3-SHA SSLv3 DH, NONE 3DES (168) SHA1

I singled these out because the “NONE” means anonymous authentication – the subject of the audit finding! Note that these ciphers were not present in the openssl listing. So now I know Nortel potentially supports anonymous (also called NULL) authentication. There remains the question of whether my specific implementation supports it. Of course the audit says it does, but I want to have sufficient expertise to verify for myself. So, try this:

openssl s_client -cipher ADH-DES-CBC3-SHA -connect IP_of_Nortel_server:443

I get:

---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 411 bytes and written 239 bytes
---
New, TLSv1/SSLv3, Cipher is ADH-DES-CBC3-SHA
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
SSL-Session:
    Protocol  : TLSv1
    Cipher    : ADH-DES-CBC3-SHA
    Session-ID: 30F1375839B8CFB508CDEFC9FBE4A5BF2D5CE240038DFF8CC514607789CCEDD5
    Session-ID-ctx:
    Master-Key: B2374E609874D1015DC55BEAA0289310445BAFF65956908A497E5C51DF1301D68CC47AB395DDFEB9A1C77B637A4D306F
    Key-Arg   : None
    Krb5 Principal: None
    Start Time: 1317132292
    Timeout   : 300 (sec)
    Verify return code: 0 (ok)
---

You see that it listed the Cipher as the one I requested, ADH-DES-CBC3-SHA. Further note that no certificate names are sent. Normally they are. To see if my method is correct, let’s try one of Google’s secure servers. Certainly Google will not permit NULL authentication if it’s a bad practice:

openssl s_client -cipher aNULL -connect 74.125.67.84:443

produces this output:

21390:error:14077410:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure:s23_clnt.c:583:

Google does not permit this cipher! As a control, let’s use openssl without specifying a specific cipher against both servers. First, the Nortel server:

openssl s_client -connect IP_of_Nortel_server:443

produces some long output, which spits out the sever certificates, followed by this:

New, TLSv1/SSLv3, Cipher is DHE-RSA-AES256-SHA
Server public key is 2048 bit
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
SSL-Session:
    Protocol  : TLSv1
    Cipher    : DHE-RSA-AES256-SHA
    Session-ID: 6D1A4383F3DBF4C14007220715ECCFB83D91C524624ACE641843880291200AE2
    Session-ID-ctx:
    Master-Key: BE3FB61B169F497A922A9A172D36A4BB15C26074021D7F22D125875980070E157EDA3100572F927B427B03BF81543E1A
    Key-Arg   : None
    Krb5 Principal: None
    Start Time: 1317132982
    Timeout   : 300 (sec)
    Verify return code: 0 (ok)

So you see client and server agreed to use the cipher DHE-RSA-AES256-SHA, which from our table uses RSA authentication. And hitting Google again without the ciphers argument we get this:

New, TLSv1/SSLv3, Cipher is RC4-SHA
Server public key is 1024 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
SSL-Session:
    Protocol  : TLSv1
    Cipher    : RC4-SHA
    Session-ID: 236FDF47DA752E768E7EE32DA10103F1CAD513E9634F075BE8773090A2E7A995
    Session-ID-ctx:
    Master-Key: 39212DE0E3A98943C441287227CB1425AE11CCA277EFF6F8AF83DA267AB256B5A8D94A6573DFD54FB1C9BF82EA302494
    Key-Arg   : None
    Krb5 Principal: None
    Start Time: 1317133483
    Timeout   : 300 (sec)
    Verify return code: 0 (ok)
---

So in this case it is successful, though it has chosen a different cipher from Nortel, namely RC4-SHA. But we can look it up and see that it’s a cipher which uses RSA authentication. Cool.

So we’ve “proven” all our assertions thus far. Now how do we fix Nortel? The Nortel GUI lists the ciphers as

ALL@STRENGTH

Pardon me? It turns out there are cipher groupings denoted by aliases, and you can combine the aliases into a cipher list.

ALL – means all cipher suites
EXPORT – includes cipher suites using 40 or 56 bit encryption
aNULL – cipher suites that do not offer authentication
eNULL – cipher suites that have no encryption whatsoever (disabled by default in Nortel)
STRENGTH – is at the end of the list and sorts the list in order of encryption algorithm key length

List operators are:
! – permanently deletes the cipher from the list.
+ – moves the cipher to the end of the list
: – separator of cipher strings

aNULL is a subset of ALL, and that’s what’s killing us. Putting all this together, the cipher I tried in place of ALL@STRENGTH is:

ALL:!EXPORT:!aNULL@STRENGTH

In this way I prevent NULL authentication and remove the weaker export ciphers. As soon as I applied this cipher list, I tested it. Yup – works. I can no longer hit it by using anonymous authentication:

openssl s_client  -cipher aNULL -connect IP_of_Nortel_server:443

produces

2465:error:14077410:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure:s23_clnt.c:583:

and using cipher eNULL produces the same error. To make sure I’m sending a cipher which openssl understands, I tried a nonsense cipher as a control – one that I know does not exist:

openssl s_client  -cipher eddNULL -connect IP_of_Nortel_server:443

That gives a different error:

error setting cipher list
2482:error:1410D0B9:SSL routines:SSL_CTX_set_cipher_list:no cipher match:ssl_lib.c:1188:

providing assurance that aNULL and eNULL are cipher families understood and supported by openssl, and that I have done the hardening correctly!

Now you can probably count the number of people still using Nortel gear with your two hands! But this discussion, obviously, has wider applicability. In Apache/mod_ssl there is an SSLCipherSuite line where you specify a cipher list. The auditor’s recommendation is more detailed than what I tried. They suggest the list ALL:!aNULL:!ADH:!eNULL:!LOW:!EXP:RC4+RSA:+HIGH:+MEDIUM

October 2014 Update
Well, now we’ve encountered the SSLv3 vulnerability POODLE, which compels us to forcibly eliminate use of SSLv3 on all servers and clients. Let’s say we updated our clients to require use of TLS. How do we gain confidence the update worked? Set one of our servers to not use TLS! Here’s how I did that on a BigIP server:

DEFAULT:!TLSv1:@STRENGTH

I ran a quick test using openssl s_client -connect server:443 as above, and got what I was looking for:

...
SSL handshake has read 3038 bytes and written 479 bytes
---
New, TLSv1/SSLv3, Cipher is AES256-SHA
Server public key is 2048 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
SSL-Session:
    Protocol  : SSLv3
    Cipher    : AES256-SHA
...

Note the protocol says SSLv3 and not TLS.

Turning off SSLv3 to deal with POODLE

So that is normally exactly the opposite of what you want to do to turn off SSLv3 – that was just to run a control test. Here’s what to do to turn off SSLv3 on a BigIP:

DEFAULT:!RC4:!SSLv3:@STRENGTH

OK, yes, RC4 is a discredited cipher so disable that as well. Most clients (but not all) will be able to work with a server which is set like this.


Apache and POODLE prevention
Well, I went to the Qualys site and found I was not exactly eating my own dogfood! My own server was considered vulnerable to POODLE, supported weak protocols, etc and only scored a “C.” DrJohnsScoredbyQualys Determined to incorporate more modern approaches to my apache server settings and stealing from others, I improved things dramatically by throwing these additional configuration lines into my apache configuration:

(the following apache configuration lines are deprecated – see further down below)

...
# lock things down to get a better score from Qualys - DrJ 12/17/14
# 4 possible values: All, SSLv2, SSLv3, TLSv1. Allow TLS only:
        SSLProtocol all -SSLv2 -SSLv3
        SSLCipherSuite ALL:!aNULL:!eNULL:!SSLv2:!LOW:!EXP:!RC4:!MD5:@STRENGTH
...


The results after strengthening apache configuration

I now get an “A-” and am not supporting any weak ciphers! Yeah! DrJohnsScoredbyQualys-afterSimpleTweaks It’s because those configuration lines mean that I explicitly don’t permit SSLv2/v3 or the weak RC4 cipher. I need to study to determine if I should support TLSv1.2 and forward secrecy to go to the best possible score – an “A.” (Months later) Well now I do get an A and I’m not exactly sure why the improved score.

BREACH prevention
After all the above measures the Digicert certificate inspector I am evaluating says my drjohnstechtalk site is vulnerable to the Breach attack. From my reading the only practical solution, at least for my case, is to upgrade from apache 2.2 to apache 2.4. Hence the Herculean efforts to compile apache 2.4 as detailed in this blog post. My preliminary finding is that without changing the SSL configuration at all apache 2.4 does not show a vulnerability to BREACH. But upon digging further, it has to do with the absence of the use of compression in apache 2.4 and I’m not yet sure why it isn’t being used!

2016 Update for CVE-2016-2107
I was going to check to see if my current score at SSLLabs is an A-, and what I can do to boost it to an A. Well, I got an F! I guess the lesson here is to conduct periodic tests. Things change!
qualys-drj-2016-11-10

I saw from descriptions elsewhere that my version of openssl, openssl-1.0.1e-30.el6.11, was likely out-of-date. So I looked at my version of openssl on my CentOS server:

$ sudo rpm ‐qa|grep openssl

and updated it:

$ sudo yum update openssl‐1.0.1e‐30.el6.11

Now (11/11/16) my version is openssl-1.0.1e-48.el6_8.3.

Would this upgrade suffice without any further action?

Some background. I had compiled – with some difficulty – my own version of apache version 2.4: https://drjohnstechtalk.com/blog/2015/07/compiling-apache24-on-centos/.

I was pretty sure that my apache dynamically links to the openssl libraries by virtue of the lack of their appearance as listed compiled-in modules:

$ /usr/local/apache24/bin/httpd ‐l

Compiled in modules:
  core.c
  mod_so.c
  http_core.c
  prefork.c

Simply installing these new openssl libraries did not do the trick immediately. So the next step was to restart apache. Believe it or not, that did it!

Going back to the full ssllabs test, I currently get a solid A. Yeah!
qualys-drj-2016-11-11

In the spirit of let’s learn something here beyond what the immediate problem requires, I learned then that indeed the openssl libraries were dynamically linked to my apache version. Moreover, I learned that dynamic linking, despite the name, still has a static aspect. The shared object library must be read in at process creation time and perhaps only occasionally re-read afterwards. But it is not read with every single invocation, which I suppose makes sense form a performance point-of-view.

2016 apache 2.4 SSL config section
For the record…

...
        SSLProtocol all -SSLv2 -SSLv3
        # it used to be this simple
        #SSLCipherSuite ALL:!aNULL:!eNULL:!SSLv2:!LOW:!EXP:!RC4:!MD5:@STRENGTH
# Now it isn't - DrJ 6/2/15. Based on SSL Labs https://weakdh.org/sysadmin.html - DrJ 6/2/15
        SSLCipherSuite          ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA:DHE-RSA-AES256-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:AES:CAMELLIA:DES-CBC3-SHA:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!MD5:!PSK:!aECDH:!EDH-DSS-DES-CBC3-SHA:!EDH-RSA-DES-CBC3-SHA:!KRB5-DES-CBC3-SHA
        SSLHonorCipherOrder     on
...

How to see what ciphers your browser supports
Your best bet is the SSLLABS.com web site. Go to Test my Browser.

University of Hannover offers this site. Just go this page. But lately I noticed that it does not list ciphers using CBC whereas the SSLlabs site does. So SSLlabs provides a more accurate answer.

2017 update for PCI compliance
Of course this article is ancient and I hesitate to further complicate it, but I also don’t want to tear it down. Anyway, for PCI compliance you’ll soon need to drop 3DES ciphers (3DES is pronounced “triple-DES” if you ever need to read it aloud). I have this implemented on F5 BigIP devices. I have set the ciphers to:

DEFAULT:!DHE:!3DES:+RSA

and this did the trick. Here’s how to see what effect that has from the BigIP command line:

$ tmm ‐‐clientciphers ‘DEFAULT:!DHE:!3DES:+RSA’

       ID  SUITE                            BITS PROT    METHOD  CIPHER  MAC     KEYX
 0: 49200  ECDHE-RSA-AES256-GCM-SHA384      256  TLS1.2  Native  AES-GCM  SHA384  ECDHE_RSA
 1: 49199  ECDHE-RSA-AES128-GCM-SHA256      128  TLS1.2  Native  AES-GCM  SHA256  ECDHE_RSA
 2: 49192  ECDHE-RSA-AES256-SHA384          256  TLS1.2  Native  AES     SHA384  ECDHE_RSA
 3: 49172  ECDHE-RSA-AES256-CBC-SHA         256  TLS1    Native  AES     SHA     ECDHE_RSA
 4: 49172  ECDHE-RSA-AES256-CBC-SHA         256  TLS1.1  Native  AES     SHA     ECDHE_RSA
 5: 49172  ECDHE-RSA-AES256-CBC-SHA         256  TLS1.2  Native  AES     SHA     ECDHE_RSA
 6: 49191  ECDHE-RSA-AES128-SHA256          128  TLS1.2  Native  AES     SHA256  ECDHE_RSA
 7: 49171  ECDHE-RSA-AES128-CBC-SHA         128  TLS1    Native  AES     SHA     ECDHE_RSA
 8: 49171  ECDHE-RSA-AES128-CBC-SHA         128  TLS1.1  Native  AES     SHA     ECDHE_RSA
 9: 49171  ECDHE-RSA-AES128-CBC-SHA         128  TLS1.2  Native  AES     SHA     ECDHE_RSA
10:   157  AES256-GCM-SHA384                256  TLS1.2  Native  AES-GCM  SHA384  RSA
11:   156  AES128-GCM-SHA256                128  TLS1.2  Native  AES-GCM  SHA256  RSA
12:    61  AES256-SHA256                    256  TLS1.2  Native  AES     SHA256  RSA
13:    53  AES256-SHA                       256  TLS1    Native  AES     SHA     RSA
14:    53  AES256-SHA                       256  TLS1.1  Native  AES     SHA     RSA
15:    53  AES256-SHA                       256  TLS1.2  Native  AES     SHA     RSA
16:    53  AES256-SHA                       256  DTLS1   Native  AES     SHA     RSA
17:    60  AES128-SHA256                    128  TLS1.2  Native  AES     SHA256  RSA
18:    47  AES128-SHA                       128  TLS1    Native  AES     SHA     RSA
19:    47  AES128-SHA                       128  TLS1.1  Native  AES     SHA     RSA
20:    47  AES128-SHA                       128  TLS1.2  Native  AES     SHA     RSA
21:    47  AES128-SHA                       128  DTLS1   Native  AES     SHA     RSA

2018 update and comment about PCI compliance
I tried to give the owners of e1st.smapply.org a hard time for supporting such a limited set of ciphersuites – essentially only the latest thing (which you can see yourself by running it through sslabs.com): TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384. If I run this through SSL interception on a Symantec proxy with an older image, 6.5.10.4 from June, 2017, that ciphersuite isn’t present! I had to upgrade to 6.5.10.7 from October 2017, then it was fine. But getting back to the rationale, they told me they have future-proofed their site for the new requirements of PCI and they would not budge and support other ciphersuites (forcing me to upgrade).

Another site in that same situation is https://shop-us.bestunion.com/. I don’t know if it’s a misconception on the part of the site administrators or if they’re onto something. I’ll know more when I update my own PCI site to meet the latest requirements.

2020 Update

In this year they are trying to phase out TLS v 1.0 and v 1.1 in favor of TLS v 1.2 or v 1.3. Now my web site’s grade is capped at a B because it still supports those older protocols.

Additional resources and references
As you see from the above openssl is a very useful tool, and there’s lots more you can do with it. Some of my favorite openssl commands are documented in this blog post.

A great site for testing the strength of any web site’s SSL setup, vulnerability to POODLE, etc is this Qualys SSL Labs testing site. No obnoxious ads either. A much more basic one is https://www.websiteplanet.com/webtools/ssl-checker/ SSLlabs is much more complete, but it only works on web sites running on the default port 443. websiteplanet is more about whether your certificate is installed properly and such.

Need to know what ciphers your browser supports? Qualys SSL Labs again to the rescue: https://www.ssllabs.com/ssltest/viewMyClient.html shows you all your browser’s supported ciphers. However, the results may not be reliable if you are using a proxy.

An excellent article explaining in technical terms what the problem with SSLv3 actually is is posted by, who else, Paul Ducklin the Sophos NakedSecurity blogger.

This RFC discusses why TLS v 1.2 or higher is preferred over TLS 1.0 or TLS 1.1: https://tools.ietf.org/html/rfc7525

The Digicert certificate inspector includes a vulnerability assessment as well. It seems useful.

Want a readily understandable explanation of what CBC (Cipher Block Chaining) means? It isn’t too hard to understand. This is an excellent article from Sophos’ Paul Ducklin. It also explains the Sweet32 attack.

An equally greatly detailed explanation of the openssl padding oracle vulnerability is here. https://blog.cloudflare.com/yet-another-padding-oracle-in-openssl-cbc-ciphersuites/

A fast dedicated test for CVE-2016-2107, the oracle padding vulnerability: https://filippo.io/CVE-2016-2107/. SSLlabs test is more thorough – it checks for everything – but much slower.

Compiling apache version 2.4 is described here: https://drjohnstechtalk.com/blog/2015/07/compiling-apache24-on-centos/ and more recently, here: https://drjohnstechtalk.com/blog/2020/04/trying-to-upgrade-wordpress-brings-a-thicket-of-problems/

If you want to see how your browser deals with different certificate issues (expired, bad chained CERT) as well different ciphers, this has a test case for all of that. This is very useful for testing SSL Interception product behavior. https://badssl.com/

Aimed at F5 admins, but a really good review for anyone about sipher suites, SSL vs TLS and all that is this F5 document. I recommend it for anyone getting started.

This site will never run SSL! This can be useful when you are trying to login to a hotel’s guest WiFi, which may not be capable of intercepting SSL traffic to force you to heir sign-on page: http://neverssl.com/.

Want to test if a web site requires client certificates, e.g., for authentication? This post has some suggestions.

Conclusion
We now have some idea of what those kooky cipher strings actually mean and our eyes don’t gloss over when we encounter them! Plus, we have made our Nortel gear more secure by deploying a cipher string which disallows anonymous authentication.

It seems SSL exploits have been discovered at reliable pace since this article was first published. It’s best to check your servers running SSL at least twice a year or better every quarter using the SSLlabs tool.

Categories
Admin IT Operational Excellence Linux

Splitting a Text File Into Two Lines with Awk

Intro
How do you split a text file into two lines output per one original input line? Of course there are zillions of ways, with shell, xargs, Perl, your favorite tool, etc. But I decided to revisit that old standard awk to see if it might not just be the best (most compact and intelligible) way to do it!

The Challenge
I was provided a spreadsheet concerning printers in a new building, which I was to use to create access table entries for sendmail, i.e., so that they would be permitted to relay mail (these days it seems all printers are also scanners).

I wanted to have a comment line with the native printer name, with format

# Printer_Name

Then the appropriate access table entry, which has format

IP_ADDRESS   RELAY

As an additional wrinkle the spreadsheet had columns with variable amount of whitespace! It was very similar to the input below, which I had in a file called tmp:

PA01-USCVI-B52_160-P137C              Bldg 52 Plant 1st 160           10.12.210.161
PA02-Y-B53_160-D220                 Blag 53 Plant 1st 160               10.13.209.162
PA03-UIT-B54_COPY1-D645C         Bldg 54 Plant 1st Copy Rm   10.208.211.163
PA04-RUITY-B55-P235                 Bldg 55 Plant Basement Off         10.14.205.169
PA05-THY-675            John Tollesin    Bldg 53 Plant 2nd 220          10.13.204.156 
 

Fortunately I was interested in the first and last fields, which kept things simple. Here’s what I came up with:
 

awk '{print "# "$1"\n"$NF"\tRELAY"}' tmp

Not bad, eh? In addition to being relatively few characters, it makes sense to me, so I will remember this trick for the next time, which is a timesaver.

I have to get myself to a Unix or Cygwin session to show the output, but it is as I described. I guess the biggest trick is that awk allowed me to conveniently write out two lines in one statement by creating an ASCII newline character with the “\n”character. It’s probably better known that $1 stands for the first field and $NF (number of fields) stands for the last field of a line.

Conclusion
Sexier tools have come along, but don’t give up on our old friend awk – basic knowledge of what it does can be a real timesaver.

Categories
Admin Internet Mail

Obscure Tips for Sendmail Admins

Intro
Sendmail is an amazing program. The O’Reilly Sendmail book is its equal, coming in well over 1000 pages. I constantly marvel at how it was possible to pack so much knowledge into one book written by one person. Having run sendmail for over 10 years, I’ve built up a few inside tips that can be extremely hard to find out by yourself, even with the book’s help. I just learned one today, in fact, so I thought I’d put it plus some others in one place where their chances of being useful is slightly greater.

Tip 1: Multiple IPs in a Mailertable Entry, No MX Record Required
Today I learned that you can specify multiple domains in a mailertable entry even when you’re using IP addresses, as in this example:

drjohnstechblog.com       smtp:[50.17.188.196]:[10.10.10.11]

I tested it by putting 50.17.188.196 behind a firewall where it was unreachable. Sure enough, the smtp mail delivery agent of sendmail tried 10.10.10.11 next. You can continue to extend this with additional IPs

Why is this important? If someone has provided you a private IP to forward mail to, say because of a company-to-company VPN, you cannot rely on the usual DNS lookups to do the routing. And a big outfit may have two MTAs reachable in this way. Now you’ve got redundancy built-in to your delivery methods. Just as you have for organizations with multiple MX records. I paged through the book this morning and did not find it. Maybe it’s there. But it’s in an obscure spot if it is.

Tip 2: Error message Containing Punctuation
I also don’t think it’s obvious how to include multiple punctuation marks in a custom error message, even after reading the book. Here’s an example for your access table:

To:hotimail.com   ERROR:"550 You sent an email to hotimail.com.  You probably meant hotmail.com?"

So it’s the quotes that allow you to include the several punctuation marks. The 550 at the beginning will be seen as the error number.

Tip 3: Smarttable for Sender-Based Routing Decisions
Have you ever wanted to make routing decisions based on sender address rather than recipient address? Well, you can! The key is to use smarttable. In my MC file I have:

dnl Define an enhancement, smarttable, from Andrzej Filip
dnl now at http://jmaimon.com/sendmail/anfi.homeunix.net/sendmail/smarttab.html
FEATURE(`smarttable',`hash -o /etc/mail/smarttable')dnl

It’s sufficiently well documented at that page. You need his smarttable.m4 file. So this is not for beginners, but it’s not that hard, either. Although it looks like smarttable hasn’t been updated since 2002, I want to mention that it still works with the latest versions of sendmail. You can route based on the sender domain, or an individual sender address. I use it to send some messages to an encryption gateway. My smarttable entries tend to look like this:

drjohn@drjohnstechtalk.com          relay:[192.168.12.34]

What’s First: Routing Based on Sender or Recipient??
What if your recipient’s domain is in the mailertable and your sender’s address is in the smarttable? What takes precedence in that case? The mailertable entry does. I do not know a way to change that. I actually did experience that conflict and found one way around it.

In my case I had some mailertable entries like this one:

drjohnstechblog.com            relay:drjohnstechtalk.com

with my smarttable entry as above. So I get into this conflict when drjohn@drjohnstechtalk.com wants to send email to someone@drjohnstechblog.com. What I did is run a private BIND DNS server and remove the mailertable entry. My private DNS server is mostly a cache-only server with the usual Internet root servers. But since the public Internet value for the MX record for drjohnstechblog.com is not what I wanted for mail delivery purposes, I created a zone for drjohnstechblog.com on my private DNS server and created the MX record

drjohnstechblog.com            IN   MX   0 drjohnstechtalk.com.

thus overwriting the public MX value for drjohnstechblog.com. Then, of course, I have my server where I am running sendmail set to use my private DNS server as nameserver in /etc/resolv.conf, i.e.,

nameserver 127.0.0.1

since I ran my private DNS server on the same box. Without the mailertable entry sendmail uses DNS to determine how to deliver email unless of course the sender matches a smarttable entry! If my server relies on resolving other resource records within drjohnstechblog.com for other purposes then I have to redefine them, too.

This trick works for individual domains. What if you feel the need for an “everythnig else” entry in your mailertable, i.e.,

.                     relay:relayhost.drjohnstechtalk.com

Well, you’re stuck! I don’t have a solution for you. My DNS trick above could be extended to work for mail with some wildcard entries, but it will break so many other things that you don’t want to go there.

Tip 4: How to send the same email to two (or more) different servers
Someone claimed to need this unusual feature. See the discussion in the comment section about how I believe this is possible to do and an outline of how I would do it.
The blog posting I reference about running sendmail in queue-only mode is here.

Conclusion
Hopefully these sendmail tips will make your life as a sendmail admin toiling away in obscurity (not that I know anyone like that : ) ), just a little easier.

Resources
The sendmail book is the one by Bryan Costales. At Amazon: http://www.amazon.com/sendmail-4th-Bryan-Costales/dp/0596510292/ref=sr_1_1?s=books&ie=UTF8&qid=1316630255&sr=1-1.

My most recent post on how to tame the confounding sendmail log is here.

Using smarttable with a catch-all mailertable entry, plus virtusertable and more, is described in my latest sendmail post.

Categories
Admin DNS IT Operational Excellence

The IT Detective Agency: How We Neutralized Nasty DNS Clobbering Before it Could Bite Us

This gets a little involved. But if you’re the IT expert called on to fix something, you better be able to roll up your sleeves and figure it out!

In this article, I described how some, but not all ISPs change the results of DNS queries in violation of Internet standards.

A Proxy PAC for All
This work was done for an enterprise. They want everyone to use a proxy PAC file which whose location was to be (obfuscating the domain name just a little here) http://webproxy.intranet.drjohnstechtalk.com/proxy.pac. Centralized large enterprises like this sort of thing because the proxy settings are controlled in the one file, proxy.pac, by the central IT department.

So two IT guys try this PAC file setting on their work PC at their home networks. The guy with Comcast as his ISP reports that he can surf the Internet just fine at home. I, with Centurylink, am not so successful. It takes many minutes before an eventual timeout seems to occur and I cannot surf the Internet as long as I have that PAC file configured. But I can always uncheck it and life is good.

Now along comes a new requirement. This organization is going to roll out VPN without split tunneling, and the initial authentication to that VPN is a web page on the VPN switch. Now we have a real problem on our hands.

With my ISP, I can shut off the PAC file, get to the log-on page, establish VPN, but at that point if I wanted to get back out to the Internet (which is required for some job functions) I’d have to re-establish the PAC file setting. Furthermore it is desirable to lock down the proxy settings so that users can’t change them in any case. That makes it sound impossible for Centurylink customers, right?

Wrong. By the way the Comcast guy had this whole scenario working fine.

The Gory Details
This enterprise organization happened to have chosen legitimately owned but unused internal namespace for the PAC file location, analagous to my webproxy.intranet.drjohnstechtalk.com in my example. I reasoned as follows. Internet Explorer (“IE”) must quickly learn in the Comcast case that the domain name of the PAC file (webproxy.intranet.drjohnstechtalk.com) resolves with a NXDOMAIN and so it must fall back to making DIRECT connections to the Internet. For the unfortunate soul with CenturyLink (me), the domain name is clobbered! It does resolve, and to an active web site. That web site must produce a HTTP 404 not found. At least you’d think so. Today it seems to produce a simplified PAC file, which I am totally astonished by. And I wonder if this is more recent behaviour present in an attempt to ameliorate this situation. In any case, I reasoned that if they were clobbering a non-existent DNS record, we could actually define this domain name, but instead of going through the trouble of setting up a web server with the PAC file, just define the domain name as the loopback interface, 127.0.0.1. There’s no web server to connect to, so I hoped the browser would quickly detect this as a bad PAC URL, go on its way to make DIRECT connections to the VPN authentication web site, and then once VPN were established, use the PAC file again actively to permit the user to surf the Internet. And, furthermore, that this should work for both kinds of users: ones with DNS-clobbering ISPs and ones without.

That’s a lot of assumptions in the previous paragraph! But I built the case for it – it’s all based on reasonable extrapolation from observed behaviour. More testing needs to be done. What we have seen so far is that this DNS entry does no harm to the Comcast user. Direct Internet browsing works, VPN log-in works, Internet browsing post-login works. For the CenturyLink user the presence of this DNS entry permitted the browser of the work PC to surf the Internet very readily, which is already progress. VPN was not tested but I see no reason why it wouldn’t work.

More tests need to be done but it appears to be working out as per my educated guess.

April 2012 Update
Our fix seemed to collapse like a house of cards all-of-a-sudden many months later. Read how instead of panicking, we re-fixed it using our best understanding of the problems and mechanisms involved. The IT Detective Agency: Browsing Stopped Working on Internet-Connected Enterprise Laptops

Conclusion
We found a significant issue with DNS clobbering as practiced by some ISPs in an enterprise-class application: VPN. We found a work-around after taking an educated guess as to what would work – defining webproxy… to resolve to 127.0.01. We could have also changed the domain name of the PAC file – to one that wouldn’t be clobbered – but that was set by another group and so that option was not available to us. Also, we don’t yet know how extensive DNS clobbering is at other ISPs. Perhaps some clobber every domain name which returns a NXDOMAIN flag. That’s what Google’s DNS FAQ seems to imply at any rate. A more sensible approach may have been to migrate to use the auto-detect proxy settings, but that’s a big change for an enterprise and they weren’t ready to do that. A final concern is what if the PC is running a local web server because some application requires it?? That might affect our results.

Case: just about solved!

References
A related case of Verizon clobbering TCP reset packets is described here.

Categories
Admin Internet Mail IT Operational Excellence

The IT Detective Agency: The Case of Slow Sendmail Performance Finally Cracked

I’ve been running sendmail for years and years. It’s a very solid MTA, though perhaps not fashionable these days. At one point I even made the leap from running on Sun/solaris to SLES. I’ve always had a particular problem on a couple of these servers: they do not react gracefully to mail storms. An application running on another server sends out a daily mail blast to 2000 users, all at once. Hey I’m not running Gmail here, but normal volume is several messages per second nonetheless, and that is handled fairly well.

But this mail blast actually knocks the system offline for a few minutes. The load average rockets up to 160. It’s essentially a self-inflicted denial-of-service attack. In my gut I always felt the situation could be improved, but was too busy to look into it.

When it was time to buy a replacement server, I had to consider and justify what to get. A “screaming server” is a little hard for a hardware vendor to turn into an order! So where are the bottlenecks? I decided to capture output of uptime, which provides load averages, and iostat, an optional package which analyzes I/O usage, at five secon intervals throughout the day. Here’s the iostat job:

nohup iostat -t -c  -m -x 3 > /tmp/iostat &

and the uptime was a tiny script I called cpu-loop.sh:

#!/bin/sh
while /bin/true; do
sleep 5
date
uptime
done

called from the command line as:

nohup ~/cpu-loop.sh > /tmp/cpu &

Strange thing is that though load average shoots the roof, cpu usage isn’t all that high.

If I have this right, load average shows the number of processes scheduled by the scheduler. Sendmail forks a process for each incoming email, so the number of sendmail processes climbs dramatically during a mail storm.

The fundamental issue is are we thirsting for more CPU or more I/O? Then there are the peripheral concerns like speed of pci bus, size of level two cache and number of cpus. The standard profiling tools don’t quite give you enough information.

Here’s actual output of three consecutive iostat executions:

Time: 05:11:56 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.92    0.00    5.36   21.74    0.00   66.99

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00    10.00    0.00    3.00     0.00     0.05    37.33     0.03    8.53   5.33   1.60
sdb               0.00   788.40    0.00  181.40     0.00     3.91    44.12     4.62   25.35   5.46  98.96
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    2.40     0.00     0.01     8.00     0.02    8.00   1.33   0.32
dm-3              0.00     0.00    0.00    2.40     0.00     0.01     8.00     0.01    5.67   2.33   0.56
dm-4              0.00     0.00    0.00    0.80     0.00     0.00     8.00     0.01   12.00   6.00   0.48
dm-5              0.00     0.00    0.00    7.60     0.00     0.03     8.00     0.08   10.32   1.05   0.80
hda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00  975.00     0.00     3.81     8.00    20.93   21.39   1.01  98.96
dm-7              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Time: 05:12:01 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.05    0.00    4.34   19.98    0.00   70.64

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00    10.80    0.00    2.80     0.00     0.05    40.00     0.03   10.57   6.86   1.92
sdb               0.00   730.60    0.00  164.80     0.00     3.64    45.20     3.37   20.56   5.47  90.16
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    2.60     0.00     0.01     8.00     0.03   12.31   2.15   0.56
dm-3              0.00     0.00    0.00    2.40     0.00     0.01     8.00     0.02    6.33   3.33   0.80
dm-4              0.00     0.00    0.00    0.80     0.00     0.00     8.00     0.01    9.00   5.00   0.40
dm-5              0.00     0.00    0.00    7.60     0.00     0.03     8.00     0.10   13.37   1.16   0.88
hda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00  899.60     0.00     3.51     8.00    16.18   18.03   1.00  90.24
dm-7              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Time: 05:12:06 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.91    0.00    1.36   10.83    0.00   85.89

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     6.40    0.00    3.40     0.00     0.04    25.88     0.04   12.94   5.18   1.76
sdb               0.00   303.40    0.00   88.20     0.00     1.59    36.95     1.83   20.30   5.48  48.32
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    2.60     0.00     0.01     8.00     0.04   14.77   2.46   0.64
dm-3              0.00     0.00    0.00    0.60     0.00     0.00     8.00     0.00   12.00   5.33   0.32
dm-4              0.00     0.00    0.00    0.80     0.00     0.00     8.00     0.01   11.00   5.00   0.40
dm-5              0.00     0.00    0.00    5.80     0.00     0.02     8.00     0.08   12.97   1.66   0.96
hda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00  393.00     0.00     1.54     8.00     6.46   16.03   1.23  48.32
dm-7              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Device sdb has reached crazy high utilization levels – 98% before dropping back down to 48%. An average queue size of 4.62 in the first run means a lot of queued up processes awaiting I/O. Write requests (merged) per second of 788 seems respectable. All this, while the CPU is 67% idle!

The conclusion: a solid state drive is in order. We are dying thirsting for I/O more than for CPU. But solid state drives cost money and have to be justified which takes time. Can we do something which proves it will bear out our hypothesis and really alleviate the problem? Yes! SSD is like accessing memory. So let’s build a virtual partition from our memory. tmpfs has made this sinfully easy:

mount -t tmpfs none /mqueue -o size=8192m

We set this to be sendmail’s queue directory. The sendmail mc command looks like this:

define(`QUEUE_DIR',`/mqueue/q*')dnl

which I need to further explain at some point.

Now it’s interesting that this tmpfs filesystem doesn’t even show up in iostat! I guess its usage all counts as cpu usage.

I now have to send my mail blast to the system with this tmpfs setup. I’m expecting to have essentially converted my lack of I/O into better usage of spare CPU, resulting in a higher-performance system.

The Results
The results are in and they are dramatic. Previous results using traditional 15K rotating drive:

- disk device became 98% busy
- cpu idle time only dropped as low as 69%
- load average peaked at 37
- SMTP port shut down for some minutes
- 2030 messages accepted in 187 seconds
- 11 messages/second

and now using tmpfs virtual filesystem:

- the load average rose to 3.1 - a much more tolerable result
- the cpu idle time dropped to 32% during the busiest time
- most imporantly, the server stayed open for business - the SMTP port did not shut down for the first time!!
- the 2000 messages were accepted in 34 seconds.  
- that's a record 59 messages/second!

Conclusion
Disk I/O was definitely the bottleneck for sendmail. tmpfs rocks! sendmail becomes five times faster using it, and is better behaved. The drawback of this filesystem type is that it is completely volatile and I stand to lose messages if the power ever goes out!

Case Closed!

Categories
Admin IT Operational Excellence Web Site Technologies

Virtual Server not Working in F5 BigIP

OK. This posting is only directly applicable to the small number of people who run BigIP load balancers. And of that set, only a certaIn subset will likely ever have this situation. Nevertheless, it’s useful to document it. There are lessons in it for the rest of us, it shows the creative problem-solving process used in IT, or rather the creative process that should be used.

So I had a virtual server associated with a certain pool and it was operating fine for years. Then something changes. We want to associate a new name with this virtual server, but test it first, all while keeping the old name working. Well, this is a secured site by which I mean it is running https rather than http. There’s nothing intrinsic in the web site itself that ties it to a particular name. If this were a run-of-the-mill non-secure site you would solve this problem with DNS. Set up an alias and you’re good to go. But secured sites are a wee bit trickier. They present a certificate after all. And the certificate has just one name, at least ours does. Guess I can address multi-name certificates known as Subject Alternative Name CERTs in a separate post. And that name is the original DNS name. What to do? Simple. As any BigIP admin would tell you you create a new virtual server and associate it with a new IP and a new SSL profile containing the new certificate you just bought but the old pool. In DNS assign this new IP to your new DNS name. That’s all pretty straightforward

Having done all that, I blithely tested with lynx (iI’s an old curses-based browser which runs on old Unix systems. The main point is to not test with a complex browser where like Internet Explorer where you are never 100% sure if the problem lies with the browser. If I had it, I would test with curl, but it’s not on that system.). And…it hangs.

Now I’ll admit a lot of stupid things I did (which is typical of any good debugging session of an IT professional – some self-created red herrings accompany any decent sleuthing) and I ratchet up the debugging a notch. Check the web server logs. I see no log of my lynx accesses. Dig a little deeper still. Fire up a trace. Here’s a little time-saver. BigIP does have a tcpdump program, but it is a little stunted. Typically you have multiple interfaces on a BigIP. In this case I felt it pertinent to know if packets were getting to the BigIP from lynx, and then again, if those packets were leaving the BigIP and going to the web server. So the tip is that whereas a “normal” tcpdump might allow you to use the switch -i any to listen on all interfaces, that doesn’t work on BigIP. Use -i 0.0 instead. And of course restrict it somehow so that your own shell session’s packets won’t be picked up by the trace, or else you could be in for a nasty surprise of exponentially increasing traffic (a devastating situation perhaps worthy of its own blog entry!). In this case I added an expression port 443. So I have:

tcpdump -i 0.0 port 443

And, somewhat to my surprise (You should always have a hypothesis, even if it’s just a gut feeling: will this little test work, or not. Why?) not only were packets going from lynx to BigIp and then again to the web server, I could even see returned packets back from the web server to BigIp to lynx. But it was not a lot of packets. A SYN, SYN-ACK and maybe a single data packet and that’s about it. It should have been more chatty.

The more tests you can think of, the better, especially ones that emphasize the marginal differences between the thing that works and the one that doesn’t. One test along those lines: take this same virtual server and associate it with a different pool. I did that, and that test worked!

Next, I tried to access the web server using curl on the BigIP itself. I could, but not at first. First I used the local web server URL http://web_server_ip:443/. It hung my curl command, just like using lynx on the other server. Hmm. I then looked on the web server again. I notice that it has a certificate installed. Ah. So it’s actually running https. So try curl from BigIP again, but this time with the -k switch (insecure, meaning don’t verify the certificate issuer) and a url beginning with https rather than http. Bingo. It comes back with the home page. Now we’re getting somewhere.

Finally I look more closely at the virtual server setup for the old name, the one that works. I see that the server profile is SSL. It basically means that the traffic is encrypted when it hits the BigIP, and the server CERT is associated with the external name. The BigIP decrypts the traffic, then re-encrypts it before sending it along to the web server. The CERT for the second leg is a self-signed CERT and is never seen by users.

I had forgotten to set up my new test virtual server with the server SSL profile, so the second leg of traffic was not being re-encyrpted by the BigIP, even though the web server was only willing to engage in SSL communication with the BigIP. Once I updated the server profile, it all worked fine! Of course after getting the expected results from lynx I went to my desktop browser, just like a regular user, and successfully tested it there as well. You want to make sure your final tests are a realistic approximation of what the user will be doing. If that’s not all possible under your own control, bring in a user for testing.

Liked this article? Here’s another of my IT operational excellence articles that has a somewhat wider applicability.