Categories
Internet Mail Linux Perl Raspberry Pi

Raspberry Pi phone home

Intro
In this article I described setting up my Raspberry Pi without ever connecting a monitor keyboard and mouse to it and how I got really good performance using an UHS SD card.

This article represents my first real DIY project on my Pi – one of my own design. My faithful subscribers will recall my post after Hurricane Sandy in which I reacted to an intense desire to know when the power was back on by creating a monitor for that situation. It relied on extremely unlikely pieces of infrastructure. I hinted that it may be possible to use the Raspberry Pi to accomplish the same thing.

I’ve given it a lot of thought and assembled all the pieces. Now I have a home power/Internet service monitor based on my Pi!

This still requires a somewhat unlikely but not impossible combination of infrastructure:
– your own hosted server in the cloud
– ability to send emails out from your cloud server
– access log files on your cloud server are rolled over regularly
– your Pi and your cloud server are in the same time zone
– Raspberry Pi which is acting as a server (meaning you are running it 24×7 and not rebooting it and fooling with it too much)
– a smart phone to receive alert emails or TXT messages

I used my old-school knowledge of Perl to whip something up quickly. One of this years I have to bite the bullet and learn Python decently, but it’s hard when you are so comfortable in another language.

The details
Here’s the concept. From your Pi you make regular “phone home” calls to your cloud server. This could use any protocol your server is listening on, but since most cloud servers run web servers, including mine, I phone home using HTTP. Then on your cloud server you look for the phone home messages. If you don’t see one after a certain time, you send an alert to an email account. Then, once service – be it power or Internet connectivity – is restored to your house, your Pi resumes phoning home and your cloud monitor detects this and sends a Good message.

I have tried to write minimalist code that yet will work and even handle common error conditions, so I think it is fairly robust.

Set up your Pi
On your Pi you are “phoning home” to your server. So you need a line something like this in your crontab file:

# This gets a file and leaves a timestamp behind in the access log
* * * * * /usr/bin/curl --connect-timeout 30 http://yourcloudserver.com/raspberrypiPhoneHome?`perl -e 'print time()'` > /dev/null 2>&1

Don’t know what I’m talking about when I say edit your crontab file?

> export EDITOR=vi
> crontab -e

That first line is only required for fans of the vi editor.

That part was easy, right? That will have your server “phone home” every minute, 24×7. But we need an aside to talk about time on the Pi.

Getting the right time on the Raspberry Pi
This monitoring solution assumes Ras Pi and home server are in the same time zone (because we kept it simple). I’ve seen at least a couple of my Raspberry Pi’s where the time zone was messed up so I need to document the fix.

Run the date command
$ date

Sat Apr 29 17:10:13 EDT 2017

Now it shows it is set for EDT so the timezone is correct. At first it showed something like UTC.

Make sure you are running ntp:
$ ntpq ‐p

     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
+time.tritn.com  63.145.169.3     2 u  689 1024  377   78.380    2.301   0.853
-pacific.latt.ne 68.110.9.223     2 u  312 1024  377  116.254   11.565   5.864
+choppa.chieftek 172.16.182.1     3 u  909 1024  377   65.430    4.185   0.686
*nero.grnet.gr   .GPS.            1 u  106 1024  377  162.089  -10.357   0.459

You should get results similar to those above. In particular the jitter numbers should be small, or at least less than 10 (units are msec for the curious).

If you’re missing the ntpq command then do a

$ sudo apt-get install ntp

Set the correct timezone with a

$ sudo dpkg-reconfigure tzdata

and choose Americas, then new York, or whatever is appropriate for your geography. The Internet has a lot of silly advice on this point so I hope this clarifies the point.

Note that you need to do both things. In my experience time on Raspberry Pis tends to drift so you’ll be off by seconds, which is a bad thing. ntp addresses that. And having it in the wrong timezone is just annoying in general as all your logs and file times etc will be off compared to how you expect to see them.

On your server
Here is the Perl script I cooked up. Some modifications are needed for others to use, such as email addresses, access log location and perhaps the name and switches for the mail client.

So without further ado. here is the monitor script:

#!/usr/bin/perl
# send out alerts related to Raspberry Pi phone home
# this is designed to be called periodically from cron
# DrJ - 2/2013
#
# to test good to error transition,
# call with a very small maxDiff, such as 0!
use Getopt::Std;
getopts('m:d'); # maximum allowed time difference
$maxDiff = $opt_m;
$DEBUG = 1 if $opt_d;
unless (defined($maxDiff)) {
  usage();
  exit(1);
}
# use values appropriate for your situation here...
$mailsender = '[email protected]';
$recipient = '[email protected]';
$monitorName = 'Raspberry Pi phone home';
# access line looks like:
# 96.15.212.173 - - [02/Feb/2013:22:00:02 -0500] "GET /raspberrypiPhoneHome?136456789 HTTP/1.1" 200 455 "-" "curl/7.26.0"
$magicString = "raspberrypiPhoneHome";
# modify as needed for your situation...
$accessLog = "/var/log/drjohns/access.log";
#
# pick up timestamp in access file
$piTime = `grep $magicString $accessLog|tail -1|cut -d\? -f2|cut -d' ' -f1`;
$curTime = time();
chomp($time);
$date = `date`;
chomp($date);
# your PID file is somewhere else. It tells us when Apache was started.
# you could comment out these next lines just to get started with the program
$PID = "/var/run/apache2.pid";
($atime,$mtime,$ctime) = (stat($PID))[8,9,10];
$diff = $curTime - $piTime;
print "magicString, accessLog, piTime, curTime, diff: $magicString, $accessLog, $piTime, $curTime, $diff\n" if $DEBUG;
print "accessLog stat. atime, mtime, ctime: $atime,$mtime,$ctime\n" if $DEBUG;
if ($curTime - $ctime < $maxDiff) {
  print "Apache hasn't been running long enough yet to look for something in the log file. Maybe next time\n";
  exit(0);
}
#
$goodFile = "/tmp/piGood";
$errorFile = "/tmp/piError";
#
# Think of it as state machine. There are just a few states and a few transitions to consider
#
if (-e $goodFile) {
  print "state: good\n" if $DEBUG;
  if ($diff < $maxDiff) {
    print "Remain in good state\n" if $DEBUG;
  } else {
# transition to error state
    print "Transition from good to error state at $date, diff is $diff\n";
    sendMail("Good","Error","Last call was $diff seconds ago");
# set state to Error
    system("rm $goodFile; touch $errorFile");
  }
} elsif (-e $errorFile) {
  print "state: error\n" if $DEBUG;
  if ($diff > $maxDiff) {
    print "Remain in error state\n" if $DEBUG;
  } else {
# transition to good state
    print "Transition from error to good state at $date, diff is $diff\n";
    sendMail("Error","Good","Service restored. Last call was $diff seconds ago");
# set state to Good
    system("rm $errorFile; touch $goodFile");
  }
} else {
  print "no state\n" if $DEBUG;
  if ($diff < $maxDiff) {
    system("touch $goodFile");
    sendMail("no state","Good","NA") if $DEBUG;
    print "Transition from no state to Good at $date\n";
# don't send alert
  } else {
    print "Remain in no state\n" if $DEBUG;
  }
}
####################
sub sendMail {
($oldState,$state,$additional) = @_;
print "oldState,state,additional: $oldState,$state,$additional\n" if $DEBUG;
$subject = "$state : $monitorName";
open(MAILX,"|mailx -r \"$mailsender\" -s \"$subject\" $recipient") || die "Cannot run mailx $mailsender $subject!!\n";
print MAILX qq(
$monitorName is now in state: $state
Time: $date
Former state was $oldState
Additional info: $additional
 
- sent from pialert program
);
close(MAILX);
 
}
###############################
sub usage {
  print "usage: $0 -m <maxDiff (seconds)> [-d (debug)]\n";
}

This is called from my server’s crontab. I set it like this:

 Call monitor that sends an alert if my Raspberry Pi fails to phone home - DrJ 2/13
0,5,10,15,20,25,30,35,40,45,50,55 * * * * /home/drj/pialert.pl -m 300 >> /tmp/pialert.log

My /tmp/pialert.log file looks like this so far:

Transition from no state to Good at Wed Feb  6 12:10:02 EST 2013
Apache hasn't been running long enough yet to look for something in the log file. Maybe next time
Apache hasn't been running long enough yet to look for something in the log file. Maybe next time
Transition from good to error state at Fri Feb  8 10:55:01 EST 2013, diff is 420
Transition from error to good state at Fri Feb  8 11:05:02 EST 2013, diff is 1

The last two lines result from a test I ran – i commented out the crontab entry on my Pi to be absolutely sure it was working.

The error message I got in my email looked like this:

Subject: Error : Raspberry Pi phone home
 
Raspberry Pi phone home is now in state: Error 
Time: Fri Feb  8 10:55:01 EST 2013
Former state was Good
Additional info: Last call was 420 seconds ago
 
- sent from pialert program

Why not use Nagios?
Some will realize that I replicated functions that are provided in Nagios, why not just hang my stuff off that well-established monitoring software? I considered it, but I wanted to stay light. I think my approach, while more demanding of me as a programmer, keeps my server unburdened by running yet another piece of software that has to be understood, debugged, maintained and patched. If you already have it, fine, by all means use it for the alerting part. I’m sure it gives you more options. For an approach to installing nagios that makes it somewhat manageable see the references.

A few words about sending mail
I send mail directly from my cloud server, I have no idea what others do. With Amazon, my elastic IP was initially included in blacklists (RBLs), etc, so I really couldn’t send mail without it being rejected. they have procedures you can follow to remove your IP from those lists, and it really worked. Crucially, it allowed me to send as a TXT message. Just another reason why you can’t really beat Amazon hosting (there was no charge for this feature).

And sending TXT messages
I think most wireless providers have an email gateway that allows you to send a TXT message (SMS) to one of their users via email (SMTP) if you know their cell number. For instance with Verizon the formula is

Verizon

AT&T
cell-number@txt.att.net.

T-Mobile
cell-number@tmomail.net

Conclusion
We have assembled a working power/Internet service monitor as a DIY project for a Raspberry Pi. If you want to use your Pi for a lot of other things I suggest to leave this one for your power monitor and buy another – they’re cheap (and fun)!

I will now know whenever I lose power – could be any minute now thanks to Nemo – and when it is restored, even if I am not home (thanks to my SmartPhone). See in my case my ISP, CenturyLink, is pretty good and rarely drops my service. JCP&L, not so much.

Admittedly, most people, unlike me, do not have their own cloud-hosted server, but maybe it’s time to get one?

References
Open Monitoring Distribution (OMD) makes installing and configuring nagios a lot easier, or so I am told. It is described here.
I’ve gotten my mileage out of the monitor perl script in this post: I’ve recently re-used it with modifications for a similar situation except that the script is being called by HP SiteScope, and, again, a Raspberry Pi is phoning home. Described here.

Categories
Admin Internet Mail IT Operational Excellence

The IT Detective Agency: The Case of Slow Sendmail Performance Finally Cracked

I’ve been running sendmail for years and years. It’s a very solid MTA, though perhaps not fashionable these days. At one point I even made the leap from running on Sun/solaris to SLES. I’ve always had a particular problem on a couple of these servers: they do not react gracefully to mail storms. An application running on another server sends out a daily mail blast to 2000 users, all at once. Hey I’m not running Gmail here, but normal volume is several messages per second nonetheless, and that is handled fairly well.

But this mail blast actually knocks the system offline for a few minutes. The load average rockets up to 160. It’s essentially a self-inflicted denial-of-service attack. In my gut I always felt the situation could be improved, but was too busy to look into it.

When it was time to buy a replacement server, I had to consider and justify what to get. A “screaming server” is a little hard for a hardware vendor to turn into an order! So where are the bottlenecks? I decided to capture output of uptime, which provides load averages, and iostat, an optional package which analyzes I/O usage, at five secon intervals throughout the day. Here’s the iostat job:

nohup iostat -t -c  -m -x 3 > /tmp/iostat &

and the uptime was a tiny script I called cpu-loop.sh:

#!/bin/sh
while /bin/true; do
sleep 5
date
uptime
done

called from the command line as:

nohup ~/cpu-loop.sh > /tmp/cpu &

Strange thing is that though load average shoots the roof, cpu usage isn’t all that high.

If I have this right, load average shows the number of processes scheduled by the scheduler. Sendmail forks a process for each incoming email, so the number of sendmail processes climbs dramatically during a mail storm.

The fundamental issue is are we thirsting for more CPU or more I/O? Then there are the peripheral concerns like speed of pci bus, size of level two cache and number of cpus. The standard profiling tools don’t quite give you enough information.

Here’s actual output of three consecutive iostat executions:

Time: 05:11:56 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.92    0.00    5.36   21.74    0.00   66.99

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00    10.00    0.00    3.00     0.00     0.05    37.33     0.03    8.53   5.33   1.60
sdb               0.00   788.40    0.00  181.40     0.00     3.91    44.12     4.62   25.35   5.46  98.96
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    2.40     0.00     0.01     8.00     0.02    8.00   1.33   0.32
dm-3              0.00     0.00    0.00    2.40     0.00     0.01     8.00     0.01    5.67   2.33   0.56
dm-4              0.00     0.00    0.00    0.80     0.00     0.00     8.00     0.01   12.00   6.00   0.48
dm-5              0.00     0.00    0.00    7.60     0.00     0.03     8.00     0.08   10.32   1.05   0.80
hda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00  975.00     0.00     3.81     8.00    20.93   21.39   1.01  98.96
dm-7              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Time: 05:12:01 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.05    0.00    4.34   19.98    0.00   70.64

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00    10.80    0.00    2.80     0.00     0.05    40.00     0.03   10.57   6.86   1.92
sdb               0.00   730.60    0.00  164.80     0.00     3.64    45.20     3.37   20.56   5.47  90.16
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    2.60     0.00     0.01     8.00     0.03   12.31   2.15   0.56
dm-3              0.00     0.00    0.00    2.40     0.00     0.01     8.00     0.02    6.33   3.33   0.80
dm-4              0.00     0.00    0.00    0.80     0.00     0.00     8.00     0.01    9.00   5.00   0.40
dm-5              0.00     0.00    0.00    7.60     0.00     0.03     8.00     0.10   13.37   1.16   0.88
hda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00  899.60     0.00     3.51     8.00    16.18   18.03   1.00  90.24
dm-7              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Time: 05:12:06 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.91    0.00    1.36   10.83    0.00   85.89

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     6.40    0.00    3.40     0.00     0.04    25.88     0.04   12.94   5.18   1.76
sdb               0.00   303.40    0.00   88.20     0.00     1.59    36.95     1.83   20.30   5.48  48.32
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    2.60     0.00     0.01     8.00     0.04   14.77   2.46   0.64
dm-3              0.00     0.00    0.00    0.60     0.00     0.00     8.00     0.00   12.00   5.33   0.32
dm-4              0.00     0.00    0.00    0.80     0.00     0.00     8.00     0.01   11.00   5.00   0.40
dm-5              0.00     0.00    0.00    5.80     0.00     0.02     8.00     0.08   12.97   1.66   0.96
hda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00  393.00     0.00     1.54     8.00     6.46   16.03   1.23  48.32
dm-7              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Device sdb has reached crazy high utilization levels – 98% before dropping back down to 48%. An average queue size of 4.62 in the first run means a lot of queued up processes awaiting I/O. Write requests (merged) per second of 788 seems respectable. All this, while the CPU is 67% idle!

The conclusion: a solid state drive is in order. We are dying thirsting for I/O more than for CPU. But solid state drives cost money and have to be justified which takes time. Can we do something which proves it will bear out our hypothesis and really alleviate the problem? Yes! SSD is like accessing memory. So let’s build a virtual partition from our memory. tmpfs has made this sinfully easy:

mount -t tmpfs none /mqueue -o size=8192m

We set this to be sendmail’s queue directory. The sendmail mc command looks like this:

define(`QUEUE_DIR',`/mqueue/q*')dnl

which I need to further explain at some point.

Now it’s interesting that this tmpfs filesystem doesn’t even show up in iostat! I guess its usage all counts as cpu usage.

I now have to send my mail blast to the system with this tmpfs setup. I’m expecting to have essentially converted my lack of I/O into better usage of spare CPU, resulting in a higher-performance system.

The Results
The results are in and they are dramatic. Previous results using traditional 15K rotating drive:

- disk device became 98% busy
- cpu idle time only dropped as low as 69%
- load average peaked at 37
- SMTP port shut down for some minutes
- 2030 messages accepted in 187 seconds
- 11 messages/second

and now using tmpfs virtual filesystem:

- the load average rose to 3.1 - a much more tolerable result
- the cpu idle time dropped to 32% during the busiest time
- most imporantly, the server stayed open for business - the SMTP port did not shut down for the first time!!
- the 2000 messages were accepted in 34 seconds.  
- that's a record 59 messages/second!

Conclusion
Disk I/O was definitely the bottleneck for sendmail. tmpfs rocks! sendmail becomes five times faster using it, and is better behaved. The drawback of this filesystem type is that it is completely volatile and I stand to lose messages if the power ever goes out!

Case Closed!