Categories
Admin Network Technologies

DIY Home Power Monitoring Solution – Perfect for Sandy

Intro
My recent experience losing power thanks to Sandy has gotten me thinking. How can I know when my power’s on? Or what if it gets shut off again, which by the way actually happened to me? I realized that I have all the pieces in place already and merely need to take advantage of the infrastructure already out there.

The details
I started with:

  • a working smartphone
  • an enterprise monitoring system
  • a SoHo router in my home

And that’s pretty much it!

The smartphone doesn’t really matter, as long as it receives emails. I guess a plain old cellphone would work just as well – the messages can be sent as text messages.

For enterprise monitoring I like HP SiteScope because it’s more economical than hard-core systems. I wrote a little about it in a previous post. Nagios is also commonly used and it’s open source, meaning free. Avoid Zabbix at all costs. Editor’s note: OK. I’ve changed my mind about Zabbix some seven years later. It’s still confusing as heck, but now I’m using it and I have to admit it is powerful. See this write-up.

A good SoHo router is Juniper SSG5. It extends the enterprise LAN into the home. You can carve up a Juniper router like that and provide a Home network, Work network, Home wireless and Work wireless. It’s great!

The last requirement is the key. The SoHo router at my house is always on, and so the enterprise LAN is always available, as long as I have power. Get it? I defined a simple PING monitor in SiteScope to ping my SoHo router’s WAN interface, which has a static private IP address on the enterprise LAN. If I can’t ping it, I’ve lost power, and use my monitoring system to send an email alert to my smartphone. When, or in the case of Sandy’s interminable outage, if, power ever comes back on, I send another alert letting me know that as well. If you’re not using SiteScope make sure you send several PINGs. A PING can be lost here or there for various reasons. siteScope sends out five PINGs at a regular interval, as a guideline.

Alternative
Of course if I had a business-class DSL or cable modem service with a dedicated IP I could have just PINGed that, but I don’t. With regaulr consumer grade service your IP can and will change from time to time, and using a dynamic DNS protocol (like dyndns) to mask that problem is a bit tricky.

Yes, if you call the power company they offer to call you back when your power is restored, but I like my monitor better. It tells me when things go off as well as on.

Conclusion
This is my necessity-is-the-mother-of-invention moment. Thank you, Sandy!

Categories
Admin DNS Internet Mail

The IT Detective Agency: can’t get email from one sender

Intro
For this article to make any sense whatsoever you have to understand that I enforce SPF in my mail system, which I described in SPF – not all it’s cracked up to be.

The details
Well, some domain admins boldly eliminated their SOFTFAIL conditions – but didn’t quite manage to pull it off correctly! Today I ran into this example. A sender from the domain pclnet.net sent me email from IP 64.8.71.112 which I didn’t get – my SPF protection rejected it. The sender got an error:

550 IP Authorization check failed - psmtp

Let’s look at his SPF record with this DNS query:

$ dig txt pclnet.net

; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.10.rc1.el6_3.2 <<>> txt pclnet.net
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 42145
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
 
;; QUESTION SECTION:
;pclnet.net.                    IN      TXT
 
;; ANSWER SECTION:
pclnet.net.             300     IN      TXT     "v=spf1 mx ip4:24.214.64.230 ip4:24.214.64.231 ip4:64.8.71.110 ipv4:64.8.71.111 ipv4:64.8.71.112 -all"

That IP, 64.8.71.112 is right there at the end. So what’s the deal?

Well, Google/Postini was called in for help. They apparently still have people who are on the ball because they noticed something funny about this SPF record, namely, that it isn’t correct. Notice that the first few IPs are prefixed with an ip4? Well the last IPs are prefixed with an ipv4! They are not both valid. In fact the ipv4 is not valid syntax and so those IPs are not considered by programs which evaluate SPF records, hence the rejection!

My recourse in this case was to remove SPF enforcement on an exception basis for this one domain.

Case closed!

Conclusion
It’s now a few months after my original post about SPF. I’m sticking with it and hope to increase its adoption more broadly. It has worked well, and the exceptions, such as today’s, have been few and far between. It’s a good tool in the fight against spam.

Categories
Admin DNS

The IT Detective Agency: is the guy rebooting our Linux server from France?

Intro
Sometimes you get a case that turns out to be silly and puts a smile on your face. This one is hardly worth the effort to write it up, but as I want to portray the life of an IT professional in all its glory and mundacity it’s worth sharing this anecdote.

The details
Yesterday I was migrating some SLES 11 VMs from an old VMWare server to a new platform. A shutdown was required, which I did. I noticed these particular systems had been barely used, hadn’t been booted in over a half year (and hence hadn’t been patched in that long). I guess my enterprise monitoring system actually works because not many minutes after the migration I get a call from the sysadmin, Fautt. He noticed the reboot, but he also noticed something else – a strange IP address associated with the reboot.

He says, over the phone, that the IP is 2.6.32.54 plus some other numbers. Of course when you’re listening over the phone it’s hard to both concentrate on the numbers and understand the concept at the same time. So he threw a giant red herring my way and said that that IP is blah-blah-.wanadoo.fr. Wanadoo.fr caught my ear. I recognize that as an ISP in France, with, like most ISPs, a somewhat dodgy reputation (in my opinion) for sending spam. So you see he led me down this path and I took the bait. I said I would look into it. The obvious underlying yet unspoken concern is this: if someone from France is rebooting my servers we have been completely compromised. So this was potentially very serious. Yet I knew I ahd been the one to reboot so I wasn’t actually panicked.

I ran the commands that I knew he must have run to see what he had seen. They look like this:

> last

reboot   system boot  2.6.32.54-0.3-de Thu Oct 11 14:03          (18:49)
drjohns  pts/0        sys1234.drjohns. Thu Oct 11 13:43 - down   (00:06)
root     pts/0        fauttsys.drjohns Wed Oct 10 11:16 - 11:32  (00:16)
...
root     :0           console          Wed Feb 15 15:13 - 15:55  (00:41)
root     :0                            Wed Feb 15 15:13 - 15:13  (00:00)
root     tty7         :0               Wed Feb 15 15:13 - 15:54  (00:41)
reboot   system boot  2.6.32.54-0.3-de Wed Feb 15 15:07         (33+19:43)
fauttdr  pts/1        fauttsys.drjohns Wed Feb 15 14:38 - down   (00:21)

Actually he told me he had just done a last reboot, but this presentation above makes the illusion more dramatic!

So the last entries contain the IP of the system where the admin logged in from, or domain names if the IP could be reverse looked up in DNS.

But, much like a magician who draws your attention to where he wants it to go, it is all a clever illusion. If you run

> dig -x 2.6.32.54

as Fautt surely must have, you get

; <<>> DiG 9.6-ESV-R5-P1 <<>> -x 2.6.32.54
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 7990
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
 
;; QUESTION SECTION:
;54.32.6.2.in-addr.arpa.                IN      PTR
 
;; ANSWER SECTION:
54.32.6.2.in-addr.arpa. 42307   IN      PTR     ABordeaux-652-1-113-54.w2-6.abo.wanadoo.fr.
 
;; Query time: 3 msec
;; SERVER: 10.201.145.20#53(10.114.206.104)
;; WHEN: Fri Oct 12 09:02:21 2012
;; MSG SIZE  rcvd: 96

But I figured that if this was a hack, there might be some mention in Google concerning this notorious IP. I searched for 2.6.32.54. The first few matches talked about a linux kernel.

That was the duh moment! A quick

> uname -a

on the server to confirm:

Linux drjohns28 2.6.32.54-0.3-default #1 SMP 2012-01-27 17:38:56 +0100 x86_64 x86_64 x86_64 GNU/Linux

So clearly the reboot line in the last output was using a different convention than the other entries and was providing the kernel version!

I told Fautt and he was duly embarrassed but thought it was funny (there aren’t many good jokes in a sysadmin’s routine, apparently :)).

Case closed.

Conclusion
A good night’s rest is important. When you’re tired you’re much more prone to superficial thinking that allows you to be drawn into someone else’s completely wrong picture of events.

Categories
Linux Network Technologies SLES TCP/IP

Ethernet Bridging on the cheap. Fail. Then Success with OLTV

Intro
Some experiments just don’t work out. I became curious about a technology that has various names: ethernet bridging, wide-area VLANs, OTV, L2TP, etc. It looked like it could be done on the cheap, but that didn’t pan out for me. But later on we got hold of high-end gear that implements OTV and began to get it to work.

The details
What this is is the ability to extend a subnet to a remote location. How cool is that? This can be very useful for various reasons. A disaster recovery center, for instance, which uses the same IP addressing. A strategic decision to move some, but not all equipment on a particular LAN to another location, or just for the fun of it.

As with anything truly useful there is an open source implementation(s). I found openvpn, but decided against it because it had an overall client/server description and so didn’t seem quite what I had in mind. Openvpn does have a page about creating an ethernet bridging setup which is quite helpful, but when you install the product it is all about the client/server paradigm, which is really not what I had in mind for my application.

Then I learned about Astaro RED at the Amazon Cloud conference I attended. That’s RED as in Remote Ethernet Device. That sounded pretty good, but it didn’t seem quite what we were after. It must have looked good to Sophos as well because as I was studying it, Sophos bought them! Asataro RED is more for extending an ethernet to remote branch offices.

More promising for cheapo experimentation, or so I thought at the time, is etherip.

Very long story short, I never got that to work out in my environment, which was SLES VM servers.

What seems to be the most promising solution, and the most expensive, is overlay transport virtualization (OLTV or simply OTV), offered by Cisco in their Nexus switches. I’ll amend this post when I get a chance to see if it worked or not!

December Update
OTV is beginning to work. It’s really cool seeing it for the first time. For instance, I have a server in South Carolina on an OTV subnet, IP 10.94.45.2. Its default gateway is in New Jersey! Its gateway is in the ARP table, as it has to be, but merely to PING the gateway produces this unusual time lag:

> ping 10.94.45.1

PING 10.94.45.1 (10.194.54.33) 56(84) bytes of data.
64 bytes from 10.94.45.1: icmp_seq=1 ttl=255 time=29.0 ms
64 bytes from 10.94.45.1: icmp_seq=2 ttl=255 time=29.1 ms
64 bytes from 10.94.45.1: icmp_seq=3 ttl=255 time=29.6 ms
64 bytes from 10.94.45.1: icmp_seq=4 ttl=255 time=29.1 ms
64 bytes from 10.94.45.1: icmp_seq=5 ttl=255 time=29.4 ms

See those response times? Huge. I ping the same gateway from a different LAN but same server room in New Jersey and get this more typical result:

# ping 10.94.45.1

Type escape sequence to abort.
Sending 5, 64-byte ICMP Echos to 10.94.45.1, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/0/1 ms
Number of duplicate packets received = 0

But we quickly stumbled upon a gotcha. Large packets were killing us. The thing is that it’s one thing to run OTV over dark fiber, which we know another customer is doing without issues; but to run it in an MPLS network is something else.

Before making any adjustment on our servers we found behaviour like the following:
– initial ssh to linux server works OK; but session soon freezes after a directory listing or executing other commands
– pings with the -s parameter set to anything greater than 1430 bytes failed – they didn’t get returned

So this issue is very closely related to a problem we observed on a regular segment where getvpn had just been implemented. That problem, which manifested itself as occasional IE errors, is described in some detail here.

Currently we don’t see our carrier being able to accommodate larger packets so we began to see what we could alter on our servers. On Checkpoint IPSO you can lower the MTU as follows:

> dbset interface:eth1c0:ipmtu 1430

The change happens immediately. But that’s not a good idea and we eventually abandoned that approach.

On SLES Linux I did it like this:

> ifconfig eth1 mtu 1430

In this platform, too, the change takes place right away.

By the we experimented and found that the largest MTU value we could use was 1430. At this point I’m not sure how to make this change permanent, but a little research should show how to do it.

After changing this setting, our ssh sessions worked great, though now we can’t send pings larger than 1402 bytes.

The latest problem is that on our OTV segment we can ping only one device but not the other.

August 2013 update
Well, we are resourceful people so yes we got it running. Once the dust settled OTV worked pretty well, with certain concessions. We had to be able to control the MTU on at least one side of the connection, which, fortunately we always could. Load balancers, proxy servers, Linux servers, we ended up jiggering all of them to lower their MTU to 1420. For firewall management we ended up lowering the MTU on the centralized management station.

Firewalls needed further voodoo. After pushing policy clamping needs to be turned back on and acceleration off like this (for Checkpoint firewalls):

$ fw ctl set int fw_clamp_tcp_mss 1
$ fwaccel off

Conclusion
Having preserved IPs during a server move can be a great benefit and OTV permits it. But you’d better have a talented staff to overcome the hurdles that will accompany this advanced technology.

Categories
Admin Linux ntp SLES

The IT Detective Agency: ntp server shows the wrong time after patching

Intro
One of my ntp servers hadn’t been patched in awhile so it was time. Other systems rely on it for time synchronization. The next morning after the patching I noticed that the ntp service wasn’t even running. I started it and went about my business. Checking back some minutes later, it had died again. What happened, and how to get it fixed? Read on to see how we diagnosed and solved this puzzler.

The details
I like to use ntpq -p to query my ntp server – it’s easy to type! So when I started it up the results looked like this:

> ntpq -p

     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*LOCAL(0)        .LOCL.          10 l   59   64   17    0.000    0.000   0.001
 drjegw.drjn.com 192.5.41.209     2 u   56   64   17    0.335  3605497  26.136
 drjegw2.drjn.co 192.5.41.41      2 u   56   64   17   19.241  3605534  39.621
 drjeuro.drjn.eu 128.252.19.1     2 u   60   64   17  105.970  3605532  38.946

That’s some offset, eh? 3.605 x 10^6 msec, or, when you think about it, just over an hour. And yet the local clock had no offset. Strange.

Date
I like to do a crude check of system time by running the date command – quickly – on two different systems. Lacking some sleep, I noticed eventually but not right away, that my ntpd server had a date that was retarded by almost exactly an hour. I didn’t notice it at first because I had trained myself to only look at the seconds, which were “only” off by five seconds.

I checked to see if the timezone or localization settings had been changed by the patching – they hadn’t. So I went ahead and advanced the system clock by an hour. Actually the yast GUI of SLES gave me the option to sync against a time server, so I chose my closest one and did that after I had stopped ntpd.

Next problem, please
That got the time in the ballpark. But ntpd still wasn’t behaving. It exhibited a strange behaviour I’ve never seen before – its offset kept increasing. I observed this behaviour over the course of several minutes:

> ntpq -p

     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 LOCAL(0)        .LOCL.          10 l    3   64  377    0.000    0.000   0.001
*drjegw.drjn.com 192.5.41.209     2 u  129  128  377    0.350  146.846  81.771
+drjegw2.drjn.co 192.5.41.41      2 u    1  128  377   20.211  183.047  97.286
+drjeuro.drjn.eu 128.252.19.1     2 u   72  128  377  104.931  161.696  79.561

> ntpq -p

     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 LOCAL(0)        .LOCL.          10 l    5   64  377    0.000    0.000   0.001
*drjegw.drjn.com 192.5.41.209     2 u    2  128  377    1.803  182.380  97.636
+drjegw2.drjn.co 192.5.41.41      2 u    3  128  377   20.211  183.047  97.286
+drjeuro.drjn.eu 128.252.19.1     2 u   74  128  377  104.931  161.696  79.561

> ntpq -p

     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 LOCAL(0)        .LOCL.          10 l   28   64  377    0.000    0.000   0.001
*drjegw.drjn.com 192.5.41.209     2 u   89  128  377    1.803  182.380  97.636
+drjegw2.drjn.co 192.5.41.41      2 u   90  128  377   20.211  183.047  97.286
+drjeuro.drjn.eu 128.252.19.1     2 u   32  128  377  104.667  197.864  96.296

> ntpq -p

     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 LOCAL(0)        .LOCL.          10 l    5   64  377    0.000    0.000   0.001
*drjegw.drjn.com 192.5.41.209     2 u    4  128  377    1.813  218.325 113.345
+drjegw2.drjn.co 128.118.25.5     2 u    5  128  377   19.667  219.077 113.157
+drjeuro.drjn.eu 128.252.19.1     2 u   75  128  377  104.667  197.864  96.296

Look at that offset column. See? It keeps going up, at about a rate of 40 msec every two minutes. It ain’t supposed to do that!

So a Unix pal of mine said he had encountered an issue in ntp and had commented out that local clock. I honestly had absolutely no idea what that LOCAL line did, but it had never hurt before.

The local clock comes from these lines in ntp.conf:

server 127.127.1.0              # local clock (LCL)
fudge  127.127.1.0 stratum 10   # LCL is unsynchronized

So I took those out, stopped ntpd with a sudo service ntp stop, synced the time with a sudo sntp -P no -r drjegw.drjn.com, and restarted ntpd. It didn’t work immediately, but it became apparent eventually that it was working.

Meantime I discovered the ntpdc command, which is kind of informative in this situation:

> ntpdc
ntpdc> loopinfo

offset:               0.097373 s
frequency:            -132.558 ppm
poll adjust:          12
watchdog timer:       841 s

This tells me the offset if 97 msec (already too large in my experience) and that for some reason the system clock hadn’t been adjusted in 841 s, and that the clock drift rate was -132 ppm – much, much higher than any other system

Then in a few minutes it clicked and got the offset in order:

ntpdc> loopinfo

offset:               0.000000 s
frequency:            -132.558 ppm
poll adjust:          4
watchdog timer:       11 s

So removing the local system clock seemed to be working. But what was the real cause of all this? I discussed it with an admin. Bear in mind that this is physical server. He said the system clock gets its time from a hardware clock which should be visible in the ILO. We checked it. Sure enough, there it was, reporting in the ILO – still, after we had fixed the problem at the OS level – as one hour retarded. There was no way to manually adjust it. The only option was to set up sntp servers, which we did, which forced the ILO to restart.

We logged back in to the ILO and voila, the time was right!

I now realize that in the OS the LOCAL Time was using that hardware clock, which must have drifted by an hour since the system was installed.

Before the patch the system was incrementally keeping up with the drift, making the necessary incremental changes periodically. But the discrepancy was too large for it after rebooting after the patching. In the /var/log/ntp I even see a line:

14 Sep 07:20:53 ntpd[10259]: time correction of 3605 seconds exceeds sanity limit (1000); set clock manually to the correct UTC time.

Conclusion
Now the system is better and we have:

ntpdc> loopinfo

offset:               0.057029 s
frequency:            -5.465 ppm
poll adjust:          4
watchdog timer:       24 s

That’s better, but the offset of 57 msec is still far larger than normal. But it’s useable for now.

Categories
Admin Network Technologies

The IT Detecive Agency: FTPs and other things going slow to one part of one location

Intro
What happens when you have a slowly festering problem – slow but tolerable performance? Is it a problem when no one complains but you know the numbers don’t feel right? Read on to see what happens in this exciting adventure.

The details
I always felt that one of my locations, let’s call it Scattering, was just hard-to-explain slow. Web site accesses as measured in HP SiteScope was slower than at another site, lets call it Cooper, and the variation seemed larger. You could always chalk it up to the fact that the SiteScope server was in the data center with the good performance, but there seemed more to it than that.

The situation was, apparently, tolerable and people grew accustomed to it. Until one day someone said that it just wasn’t right. Ping times from his site, let’s call it Vanderbilt, looked like this from his desktop:

C:\Users>ping dmz-host
 
Pinging dmz-host [10.93.23.12] with 32 bytes of data:
Reply from 10.93.23.12: bytes=32 time=80ms TTL=248
Reply from 10.93.23.12: bytes=32 time=107ms TTL=248
Reply from 10.93.23.12: bytes=32 time=74ms TTL=248
Reply from 10.93.23.12: bytes=32 time=78ms TTL=248

And yet, ping times to another piece of equipment in that same physical server room looked much healthier – on order of 22 msec with very little jitter. What the…?

So I started looking around at my equipment. My equipment generally had good response times amongst itself, but when it crossed over to another group’s equipment it started to go up. Finally I saw it – a common denominator. My equipment had its gateway supplied by a different group. PINGing the local gateway gave 15 msec response times, with high jitter! I’ve never seen that before. You normally expect to get response times , 1 msec. Time to turn the problem over to them.

But as a former scientist, I like to have a hypothesis about what might be wrong. I thought a bad cable, because I’ve heard of that. Duplex mis-matches on the port are always a strong possibility. Guess what? It wasn’t any of those things. But it was something about the port on their equipment. What do you suppose it was? Hint: the equipment they were using was leftover stuff that needed to be replaced but hadn’t because it just kept working.

Mystery Resolved
The port was saturated! It was older equipment with 100 mbit ports, the amount of traffic slowly built up on it over time and sort of crept up on us all. I guess finally the demand on it was causing noticeable problems in response times that someone finally complained!

He shifted the gateway to a 1 gbit port and things got much better. Here are those PINGs now:

C:\Users>ping 10.93.23.12
 
Pinging 10.93.23.12 with 32 bytes of data:
Reply from 10.93.23.12: bytes=32 time=24ms TTL=248
Reply from 10.93.23.12: bytes=32 time=23ms TTL=248
Reply from 10.93.23.12: bytes=32 time=23ms TTL=248
Reply from 10.93.23.12: bytes=32 time=23ms TTL=248
 
Ping statistics for 10.93.23.12:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 23ms, Maximum = 24ms, Average = 23ms

That’s much better!

The HP SiteScope timing are more subtle, but no less important. Users don’t tolerate slow-loading web pages, you know!

I need to make an aside here. Is there no free and widely distributed Linux package to calculate the standard deviation of a set of numbers? I can’t believe it. I had to borrow this guy’s shell script, and it’s pretty lousy because it’s really slow: http://panoskrt.wordpress.com/2009/03/10/shell-script-for-standard-deviation-arithmetic-mean-and-median/. Anyways, doing a before and after analysis on my URL timing, where I fetched a Google page plus images during the daytime from 10 AM to 4 PM, I find this before and after result:

Before

Number of data points in “sca” = 144
Arithmetic mean (average) = .630548611
Standard Deviation = .225365467
Median number (middle) = .664500000

After

Number of data points in “sca2” = 144
Arithmetic mean (average) = .349270833
Standard Deviation = .162399076
Median number (middle) = .263000000

Users were very happy after the upgrade.

Conclusion
An old gateway port that maxed out its traffic capacity was to blame for performance problems in one part of a data center. This doesn’t happen too often, but it can happen, obviously.

Linux needs a simple math package to allow to calculate the standard deviation of a set of numbers.

Categories
Admin Internet Mail

SPF – not all it’s cracked up to be

Intro
I recently implemented a strict enforcement of Sender Policy Framework records on a mail server. The amount of false positives wasn’t worth the added benefit.

The details
When I want to learn about SPF I turn to the Wikipedia article. Many secure mail gateways offer the possibility to filter out emails based on the sending MTA being in the list allowed by that domains SPF record in DNS. I began to turn on enforcement domain-by-domain. First for bbb.org since their domain was spoofed by an annoying spam campaign earlier this year. Then fedex.com (another perennial favorite), aexp.com, amazon.com, ups.com, etc.

I was stricter than the standard. FAIL -> reject; SOFTFAIL -> reject! And it seemed to work well. Last week I went whole hog and turned on SPF enforcement for all domains with a defined SPF record, but with slightly relaxed standards. FAIL -> REJECT; SOFTFAIL -> quarantine. While it’s true that it may have helped prevent some new spam campaigns, complaints started mounting. Stuff was going into quarantine that never used to!

I knew SOFTFAIL must be the issue. Most(?) domains seem to have a SOFTFAIL condition. Here is a typical example for the domain communica-usa.com:

$ dig txt communica-usa.com

; <<>> DiG 9.7.3-P3-RedHat-9.7.3-8.P3.el6_2.2 <<>> txt communica-usa.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 15711
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
 
;; QUESTION SECTION:
;communica-usa.com.             IN      TXT
 
;; ANSWER SECTION:
communica-usa.com.      600     IN      TXT     "v=spf1 ip4:72.241.62.1/26 ~all"

So the ~all at the end means they have defined a SOFTFAIL that matches “everything else.” And if a domain does have a SOFTFAIL contingency in its SPF record, this is almost always how it is defined. Which, when you think about it, is the same, formally, as not having bothered to define an SPF record at all! Because the proper action to take is NONE.

I reasoned as follows: why bother defining an SPF record unless you know what you’re doing and you feel you actually can identify the MTAs your mail will come from? Except when you’re first starting out and learning the ropes. That’s why I took stronger action than the standard suggested for SOFTFAIL.

A learning experience
Well, it wasn’t immediate, but after a week to ten days the complaints started rolling in. Looking at what had actually been released from quarantine was an eye-opener. It confirmed something I had previously seen, namely that only a minority speak up so the problems brought to my attention amounted to the tip of the iceberg. So the problem of my own creation was quite widespread in actual fact. I investigated a few by hand. Yup, they were sent from IPs not permitted by the narrowly defined part of their SPF record, and they were repeat offenders. Domains such as the aforementioned communica-usa.com, right.com, redphin.com, etc, etc. I haven’t gotten an explanation for a single infringement – finding someone with sufficient knowledge in these small companies to have a discussion with is a real challenge.

Waving the white flag
I had to back down on my quarantining SOFTFAILs in the global SPF setting. Now it’s SOFTFAIL -> PASS. I could have made domain-by-domain exceptions, but when the number of problem domains climbed over a hundred I decided against that reactive approach. I kept the original set of defined domains, ups.com, etc with their more-aggressive settings. These are larger companies, anyways, which either didn’t even have a SOFTFAIL defined, or never needed their SOFTFAIL.

August 2013 update
Well, a year has passed since this experiment – long enough to establish a trend. I was hoping by writing this article to nudge the community in the direction of wider adoption of SPF usage and more importantly, enforcement. Alas, I can now attest that there is no perceptible trend in either direction. How do I know? Because even with our more lax enforcement, we still ran into problems with lots of senders. I.e., they were too, how do I say this politely, confused, and apparently unable to make their strict SPF records actually match their sending hosts! Incredible, but true. If there were widespread or even spotty adoption, even 5 – 10 % of MTAs adopting enforcement of strict SPF records, they would also have refused messages from senders like this and these senders would have been motivated to fix their self-inflicted problems. But, in reality, what I have seen is complete lack of understanding amongst any of the dozen or so companies with this issue and complete inability to resolve it, forcing me to make an exception for their error.

SPF records can be bad for a number of reasons. Some companies provide two of them, and that is a source of problems. Here are a couple examples I had this issue with:

scterm.com

scterm.com.             3600    IN      TXT     "v=spf1 ip4:65.122.37.36 include:spf.protection.outlook.com -all"
scterm.com.             3600    IN      TXT     "MS=ms56827535"

atlantichealth.org

atlantichealth.org.     3600    IN      TXT     "v=spf1 mx ip4:198.140.183.244/32 ip4:198.140.183.245/32 ip4:198.140.184.244/32 ip4:198.140.184.245/32 ip4:207.211.31.0/25 ip4:205.139.110.0/24 ip4:205.139.111.0/24 -all"

So my guesstimate is < 1% adoption of even the most conservative (meaning least likely to reject legitimate emails) SPF enforcement. Too bad about that. Spammers really appreciate it, though. Conclusion
By being forced to back off aggressive SPF settings most domains with SOFTFAIL defined as “~all” – which is most of them – are lost to any enforcement. It’s a shame. SPF sounded so promising to me at first. Spammers are really good at controlling botnets, but terrible at controlling corporate MTAs. But still I do advocate for wider adoption of SPF, without the SOFTFAIL condition, of course!

References
A nice SPF validator is found here. It has no obnoxious ads or spyware.

Categories
Admin Web Site Technologies

unece.org site is busted again – how to access it anyways

Intro
I don’t usually provide tips on productive use of Internet tools, but, since I needed this one myself today, and I wasn’t 100% sure it was going to work, I think this is worth mentioning.

unece.org – completely broken
A not-so-pretty error message gets spit out, including the text:

Configuration Error: 404 page "/webhome/unece.org/www/fileadmin/read.txt" could not be found.

which reveals just a little too much about the hosting environment (i.e., shared) in my opinion.

Who cares? I like this site because it can act as a sort of arbiter in assigning three-letter names to virtually any locale around the world that has more than a few thousand people. Put the UN two-letter country code in front of that and you have a short, unique five-letter reference to any town or city in the world. This is my bookmark: http://www.unece.org/cefact/locode/service/location.htm

But today it’s not working. Maybe they’ll fix it tomorrow. Who knows? But I need a location code today. What to do? Well, it’s clearly a simple-minded site with static content – a perfect candidate for the Wayback machine at archive.org.

Long story short, yes, a July, 2011 site version was archived and is available. The archived link becomes:

http://web.archive.org/web/20110605223242/http://www.unece.org/cefact/locode/service/location.htm

And, cool, we’re in business and able to look up location codes!

Categories
DNS IT Operational Excellence Network Technologies

The IT Detective Agency: Roku player can’t connect to Internet

Intro
I bought a used Roku player, just to try it out. Things didn’t go right at first.

The details
Setting it up, I got to the point where it tests its Internet connection. There are three status lights. The first two were OK, but the Internet connection returned an error, I think error #9. It didn’t matter whether I used a wired or wireless connection.

I have an unusual Internet router at home, a Juniper SSG5. All my other devices off that router were getting to Internet just dandy however. Long story short, it turns out that there was a slight DNS misconfiguration. The Juniper had itself as secondary DNS server. There was no primary DNS server. Why this only affected Roku is still a mystery.

I updated the Juniper config to use 4.2.2.3 as primary DNS server and the Roku connected just fine and it worked wirelessly as well.

Case closed!

Conclusion
Well, as a network type-of-guy, I would have appreciated access to the Roku Linux OS so I could have seen for myself what was going on. I’m sure I would have figured out the problem much sooner than I did. The error message was not particularly helpful and most of the online discussions attribute that problem to other causes than was the DNS issue discovered here. Short of OS access, more robust debugging tools, such as PING, would have been might handy.

Categories
Admin Security Web Site Technologies

DOS Attack from South Korea

Intro
Last week I witnessed a denial-of-service (DOS) attack against a web server. This is disconcerting to say the least. I felt by putting the information out in the open some leverage might be applied to the responsible party for that server or subnet.

The source of the attack was a single IP, 58.180.70.160.

Some Details
The attack occurred July 31st.

It consisted of over 100,000 object accesses.

I do not see evidence of an actual hack.

It lasted about 10 hours, from 12:30 PM EST to 10:57 PM, with some few preliminary accesses starting at 4 AM before the onslaught at Noon.

Clearly some kind of tool was used. It is very aggressive around forms, doing lots of variations in its POST to the same form, over and over. Here’s one small snippet that gives the flavor:

58.180.70.160 - - [31/Jul/2012:22:53:05 -0400] "POST /[omitted]/forgot_password.jsp../../../../../../../../../../etc/passwd/./././././././././././[pattern repeated - total URI is over 4000 characters!]

Then there’s the fishing around for files which I suppose if present may become a vector for attack, like these lines:

58.180.70.160 - - [31/Jul/2012:12:31:47 -0400] "GET /qlc9tjIqjR.cfm HTTP/1.1" 404 292 "-" "Mozilla/4.0 (compa
tible; MSIE 8.0; Windows NT 6.0)" "-" "-"
58.180.70.160 - - [31/Jul/2012:12:31:48 -0400] "GET /_vti_pvt/authors.pwd HTTP/1.1" 404 292 "-" "Mozilla/4.0
(compatible; MSIE 8.0; Windows NT 6.0)" "-" "-"
58.180.70.160 - - [31/Jul/2012:12:31:47 -0400] "GET / HTTP/1.1" 200 9491 "-" "Mozilla/4.0 (compatible; MSIE 8
.0; Windows NT 6.0)" "-" "-"
58.180.70.160 - - [31/Jul/2012:12:31:48 -0400] "GET /crossdomain.xml HTTP/1.1" 404 292 "-" "Mozilla/4.0 (comp
atible; MSIE 8.0; Windows NT 6.0)" "-" "-"
58.180.70.160 - - [31/Jul/2012:12:31:48 -0400] "GET /solr/select/?q=test HTTP/1.1" 404 292 "-" "Mozilla/4.0 (
compatible; MSIE 8.0; Windows NT 6.0)" "-" "-"
58.180.70.160 - - [31/Jul/2012:12:31:47 -0400] "GET /inexistent_file_name.inexistent0123450987.cfm HTTP/1.1"
404 292 "-" "<script>alert(12345)</script>" "-" "-"
58.180.70.160 - - [31/Jul/2012:12:31:47 -0400] "GET http://www.acunetix.wvs HTTP/1.1" 404 292 "-" "Mozilla/4.
0 (compatible; MSIE 8.0; Windows NT 6.0)" "-" "-"
58.180.70.160 - - [31/Jul/2012:12:31:47 -0400] "GET /index HTTP/1.1" 404 292 "-" "Mozilla/4.0 (compatible; MS
IE 8.0; Windows NT 6.0)" "-" "-"
58.180.70.160 - - [31/Jul/2012:12:31:47 -0400] "GET /server-info HTTP/1.1" 404 292 "-" "Mozilla/4.0 (compatib
le; MSIE 8.0; Windows NT 6.0)" "-" "-"
58.180.70.160 - - [31/Jul/2012:12:31:48 -0400] "POST /_vti_bin/shtml.exe?_vti_rpc HTTP/1.1" 405 124 "-" "Mozi
lla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)" "-" "-"

The line mentioning Acunetix is I think a key to understanding this attack. I think it is leaving a cookie crumb trail to let the attacked site know what tool was used: Acunetix WVS Free Edition scanner.

So it’s a semi-legitimate tool, or at least one openly referenced by Google. But in the hands of a malicious or simply ignorant user, such a tool actually does become a DOS weapon. Its aggressive POSTs often have consequences for back-end databases. A medium-duty site is typically not tuned to handle 8,000 POSTS per hour, which is the rate they came in – over two per second. So the site can be tied up servicing these POSTs and not have resources left over to handle legitimate requests.

To get some background on where this originated from and by whom, I did a whois search on ARIN, the main Internet address space registry. This told me that that address space has been delegated to APNIC, the ASIA-Pacific NIC. So I did a Whois search on APNIC. That showed me that in fact that range is in turn delegated to Korea NIC! So it’s off to the KRNIC and their whois… That in turn gave me the most specific information of all:

More specific assignment information is as follows.
 
[ Network Information ]
IPv4 Address       : 58.180.68.0 - 58.180.71.255 (/22)
Network Name       : SHINBIRO-INFRA
Organization Name  : ONSE Telecom
Organization ID    : ORG2324
Address            : 2F Bundang Center, Onse IDC, 235-230, 
Zip Code           : 448-500
Registration Date  : 20081107
Publishes          : Y
 
 See the Contact Info
[ Technical Contact Information ] Name : IP Manager Organization Name : ONSE Telecom Zip Code : 448-500 Phone : +82-2-1666-0120 E-Mail : [email protected]
 
 
- KISA/KRNIC Whois Service -

Obviously it doesn’t get so specific that it tells me who owns the attacking host, but the mentioned subnet, 58.180.68.0/22 is not too big, all things considered, so that’s quite specific compared to where we were when we started our journey. It seems that its administered by ONSE Telecom in South Korea.

I tried to contact them by email – the email bounced. I also wrote to [email protected], which also bounced.

KRNIC is run by KISA, which claims to be an Internet security enterprise (fat chance in my book). I filled out their online contact form – it consistently failed to accept my message, no matter how simple I made it.

At this point I decided that this is too many offenses. The owners of that subnet, 58.180.68.0/22 are not playing by the rules of the Internet and they should be cut off from the Internet until they follow the rules.

So I set up rules to discard all packets from subnet 58.180.68.0/22 and I recommend that others do the same on their web sites.

Amazon Cloud not very sophisticated with regards to security
I initially tried to take my own advice and sought to put the above subnet into a deny security rule on my Amazon cloud service. I work with firewalls and this sort of thing is commonplace and been around for well over 15 years. But you can’t do it on the Amazon Cloud security service! I was shocked. They have so many sophisticated services and have clearly thought out this “cloud thing” better than just about anyone, but they didn’t stop to listen to customer requests on this basic security issue – a allow all but these subnets type of capability.

Conclusion
A DOS was observed emanating from a host in South Korea using Acunetix’s free WVS scanner. The owner of the subnet does not post legitimate contact information and should be banned from Internet connectivity for this egregious behaviour which was perpetrated using its address space.