iostat – Dr John's Tech Talk

Intro
I was running a new daemon on my server, factomd, to experiment with digital currency. It’s an old m1.small instance with only 1.7 GB of memory. The first few times I ran it it would 70000 or so blocks, I would let it run overnight, and then it would run out of memory and crash. My admin skills are a little rusty and dated but I eventually realized that adding swap space to my server could help.

The details
I’ve been running this server for five years and never bothered to create a swap area, as it turns out. My CentOS version is, I think, version 6.0, but it’s hard to tell at this point. Anyway, this command shows the lack of an active swap space:

$ sudo swapon ‐s

Filename                                Type            Size    Used    Priority

What to do?
Amazon has introduced SSD storage and that is recommended for high I/O demands. That makes sense to me to use for swap, which is basically an extension of your memory. It’s also inexpensive in small volumes. I decided to create a 2 GB swap file – roughly the same size as the machine’s physical memory. So I bought a gp2 – general purpose – SSD volume of 2 GB. It’s only $0.20/month!

Where did it go?
After attaching it to my instance, I got what is apparently a one-time message saying what device it would appear as on my instance – /dev/sdg. I was a little nervous – justifiably as it turns out – that I would not see it from CentOS. I tried to mount it – no go. Then I did Internet research and found these two informative commands:

$ sudo lsblk ‐‐output NAME,TYPE,SIZE,FSTYPE,MOUNTPOINT,LABEL

NAME    TYPE  SIZE FSTYPE MOUNTPOINT LABEL
xvdj    disk  100G ext4   /mnt/vol
xvde    disk    6G
`-xvde1 part    6G ext4   /
xvde3   disk  896M swap
xvdk    disk    2G

and

$ sudo fdisk ‐l

Disk /dev/xvdj: 107.4 GB, 107374182400 bytes
255 heads, 63 sectors/track, 13054 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
 
 
Disk /dev/xvde: 6442 MB, 6442450944 bytes
255 heads, 63 sectors/track, 783 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xaae7682d
 
    Device Boot      Start         End      Blocks   Id  System
/dev/xvde1   *           1         783     6289416   83  Linux
 
Disk /dev/xvde3: 939 MB, 939524096 bytes
255 heads, 63 sectors/track, 114 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
 
 
Disk /dev/xvdk: 2147 MB, 2147483648 bytes
22 heads, 16 sectors/track, 11915 cylinders
Units = cylinders of 352 * 512 = 180224 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x83d4c8ed

Turns out I had a swap file all along but had never activated it! Further, both these commands show that the new volume is appearing as xvdk, not xvdg. Go figure. I guess I had an xvdj volume and it took the next available letter. The mount command also showed me which of the above volumes was in use so I could see which had been added.

Then I used fdisk to create a swap space on it:

$ fdisk /dev/xvdk

Command (m for help): c
DOS Compatibility flag is not set
 
Command (m for help): u
Changing display/entry units to sectors
 
Command (m for help): n
Command action
   e   extended
   p   primary partition (1-4)
p
Partition number (1-4): 1
First sector (2048-4194303, default 2048):
Using default value 2048
Last sector, +sectors or +size{K,M,G} (2048-4194303, default 4194303):
Using default value 4194303
 
Command (m for help): w
The partition table has been altered!
 
Calling ioctl() to re-read partition table.
Syncing disks.

$ ls /dev/xvdk*

/dev/xvdk  /dev/xvdk1

$ sudo mkswap /dev/xvdk1

Setting up swapspace version 1, size = 2096124 KiB
no label, UUID=0d782596-03e6-48fd-a0fa-2d0e3174f727

$ sudo swapon /dev/xvdk1
The previous command activated our new swap file. To show that we run this command:
$ sudo swapon ‐s

Filename                                Type            Size    Used    Priority
/dev/xvdk1                              partition       2096120 0       -1

Finally to make this swap partition persist after a reboot I added this line to /etc/fstab:

/dev/xvdk1      swap            swap    defaults        0 0

Did it help?
Why yes it did! Now I am using over 900 Mb of swap space, so it was needed pretty badly in fact:

$ sudo swapon ‐s

Filename                                Type            Size    Used    Priority
/dev/xvdk1                              partition       2096120 945552  -1

. And my original motivation – keeping factomd from crashing – was achieved as well. Perhaps it wasn’t so important to use an SSD volume. Mostly the i/o per second was well below 100. But I did have the satisfaction of seeing this burst to 1000, a figure I never could have hit with a traditional drive.

Appendix
Monitoring i/o
These blockchain verifiers can be killers in terms of resource consumption on little servers like mine. The best tool for analyzing what is going on is iostat:

$ iostat ‐xz 10

Linux 2.6.32-131.17.1.el6.x86_64 (ip-10-185-21-116)     05/01/17        _x86_64_        (1 CPU)
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.92    0.00    0.17    0.24    0.85   97.83
 
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvdj              0.00     0.45    0.22    0.35     6.90     6.41    23.60     0.01   11.87    8.33   14.05   1.43   0.08
xvde              0.00     0.02    0.02    0.57     0.55     4.70     8.93     0.01   15.32    6.62   15.64   2.84   0.17
xvdep3            0.00     0.00    0.00    0.00     0.00     0.00     8.73     0.00    1.95    1.95    0.00   1.94   0.00
xvdk              0.00     0.01    0.02    0.01     0.19     0.16    11.35     0.00    3.23    0.92   10.75   0.19   0.00
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.65    0.00    6.44   83.93    1.42    4.56
 
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvdj              0.00     1.71  232.42    2.11  3440.68    30.54    14.80     0.43    1.84    1.80    6.95   1.72  40.38
xvde              0.00     0.00   74.59    3.65  3773.45    29.17    48.61     0.31    3.99    3.36   16.91   0.99   7.77
xvdk              5.47   414.93  606.78  230.37  4898.01  5162.39    12.02     1.89    2.26    0.88    5.89   0.18  14.89
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.63    0.00    4.19   89.55    1.23    2.40
 
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvdj              0.00     0.00  374.08    0.50  5435.98     4.02    14.52     0.84    2.25    2.25    4.33   1.32  49.32
xvde              0.00     0.00    3.52    0.28   185.03     2.23    49.29     0.01    1.66    1.41    4.80   0.72   0.27
xvdk              1.79    99.72  521.96  108.88  4189.94  1668.83     9.29     0.76    1.21    0.72    3.53   0.14   8.95
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.05    0.00    7.10   72.87    8.46    3.52
 
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvdj              0.00     0.00  338.02    8.25  6812.99    66.04    19.87     0.94    2.72    2.71    3.18   1.44  49.84
xvde              0.00     0.00   52.17    1.76  2317.73    14.07    43.24     0.15    2.72    2.43   11.23   0.67   3.63
xvdk              9.20   381.12 1180.58  256.16  9518.27  5098.24    10.17     1.95    1.36    0.78    4.04   0.14  20.65
...

Linux 2.6.32-131.17.1.el6.x86_64 avg-cpu: %user 0.92 Device: rrqm/s wrqm/s xvdj xvde xvdep3 xvdk avg-cpu: %user 3.65 Device: rrqm/s wrqm/s xvdj xvde xvdk avg-cpu: %user 2.63 Device: rrqm/s wrqm/s xvdj xvde xvdk avg-cpu: %user 8.05 Device: rrqm/s wrqm/s xvdj xvde xvdk ... (ip-10-185-21-116) 05/01/17 _x86_64_ (1 CPU) %nice %system %iowait %steal %idle 0.00 0.17 0.24 0.85 97.83 r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util 0.00 0.45 0.22 0.35 6.90 6.41 23.60 0.01 11.87 8.33 14.05 1.43 0.08 0.00 0.02 0.02 0.57 0.55 4.70 8.93 0.01 15.32 6.62 15.64 2.84 0.17 0.00 0.00 0.00 0.00 0.00 0.00 8.73 0.00 1.95 1.95 0.00 1.94 0.00 0.00 0.01 0.02 0.01 0.19 0.16 11.35 0.00 3.23 0.92 10.75 0.19 0.00 %nice %system %iowait %steal %idle 0.00 6.44 83.93 1.42 4.56 r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util 0.00 1.71 232.42 2.11 3440.68 30.54 14.80 0.43 1.84 1.80 6.95 1.72 40.38 0.00 0.00 74.59 3.65 3773.45 29.17 48.61 0.31 3.99 3.36 16.91 0.99 7.77 5.47 414.93 606.78 230.37 4898.01 5162.39 12.02 1.89 2.26 0.88 5.89 0.18 14.89 %nice %system %iowait %steal %idle 0.00 4.19 89.55 1.23 2.40 r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util 0.00 0.00 374.08 0.50 5435.98 4.02 14.52 0.84 2.25 2.25 4.33 1.32 49.32 0.00 0.00 3.52 0.28 185.03 2.23 49.29 0.01 1.66 1.41 4.80 0.72 0.27 1.79 99.72 521.96 108.88 4189.94 1668.83 9.29 0.76 1.21 0.72 3.53 0.14 8.95 %nice %system %iowait %steal %idle 0.00 7.10 72.87 8.46 3.52 r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util 0.00 0.00 338.02 8.25 6812.99 66.04 19.87 0.94 2.72 2.71 3.18 1.44 49.84 0.00 0.00 52.17 1.76 2317.73 14.07 43.24 0.15 2.72 2.43 11.23 0.67 3.63 9.20 381.12 1180.58 256.16 9518.27 5098.24 10.17 1.95 1.36 0.78 4.04 0.14 20.65

Always mentally discard the first set of numbers when iostat starts up. It needs to initialize its counters from that reading. But this is chock full of information. The cpu time spent waiting for i/o is too high: 70 – 90 % and a lot of that can be blamed on xvdj (%util column for device xvdj). The way I see it if your i/o were instantaneous this number would drop to 0 and our cpu could be doing other more productive things, hence it shows it is a bottleneck 60% of the time. This also shows my swap, xvdk, being sometimes heavily used and not being too much a bottleneck (20% util).

Then of course there is top, which just confirms that factomd is the resource hog:

$ top

top - 11:45:12 up 1246 days, 14:49,  3 users,  load average: 1.55, 1.73, 1.67
Tasks: 108 total,   1 running, 107 sleeping,   0 stopped,   0 zombie
Cpu(s): 10.6%us,  1.7%sy,  0.0%ni,  4.6%id, 82.3%wa,  0.0%hi,  0.2%si,  0.6%st
Mem:   1695600k total,  1682160k used,    13440k free,     1400k buffers
Swap:  2096120k total,  1003088k used,  1093032k free,    45348k cached
 
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
29702 john      20   0 2956m 1.3g 3984 S 21.4 77.9 490:35.59 factomd
...

Type of cpu
Just for the record here’s the type of cpu you get with an m1 small instance:

$ cat /proc/cpuinfo

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 45
model name      : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
stepping        : 7
cpu MHz         : 1799.999
cache size      : 20480 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu de tsc msr pae cx8 cmov pat clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc up rep_good aperfmperf unfair_spinl
ock pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 x2apic popcnt aes hypervisor lahf_lm arat epb xsaveopt pln pts
bogomips        : 3599.99
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

So that’s a single 2 GHz cpu.

Conclusion
We showed how to economically add swap to a CentOS image on Amazon AWS. We showed factomd successfully running on this small instance and we showed linux commands that can be used to monitor resource consumption. Knowing what I know now – that factomd is i/o limited – in addition to creating a swap space I probably would have put its files onto its own SSD drive, which is their recommendation anyway.

References and related
I followed this post for the swap partition creation steps: http://network-howtos.blogspot.com/2015/04/adding-new-swap-partition-to-centos-vm.html

I’ve been running sendmail for years and years. It’s a very solid MTA, though perhaps not fashionable these days. At one point I even made the leap from running on Sun/solaris to SLES. I’ve always had a particular problem on a couple of these servers: they do not react gracefully to mail storms. An application running on another server sends out a daily mail blast to 2000 users, all at once. Hey I’m not running Gmail here, but normal volume is several messages per second nonetheless, and that is handled fairly well.

But this mail blast actually knocks the system offline for a few minutes. The load average rockets up to 160. It’s essentially a self-inflicted denial-of-service attack. In my gut I always felt the situation could be improved, but was too busy to look into it.

When it was time to buy a replacement server, I had to consider and justify what to get. A “screaming server” is a little hard for a hardware vendor to turn into an order! So where are the bottlenecks? I decided to capture output of uptime, which provides load averages, and iostat, an optional package which analyzes I/O usage, at five secon intervals throughout the day. Here’s the iostat job:

nohup iostat -t -c  -m -x 3 > /tmp/iostat &

and the uptime was a tiny script I called cpu-loop.sh:

#!/bin/sh
while /bin/true; do
sleep 5
date
uptime
done

called from the command line as:

nohup ~/cpu-loop.sh > /tmp/cpu &

Strange thing is that though load average shoots the roof, cpu usage isn’t all that high.

If I have this right, load average shows the number of processes scheduled by the scheduler. Sendmail forks a process for each incoming email, so the number of sendmail processes climbs dramatically during a mail storm.

The fundamental issue is are we thirsting for more CPU or more I/O? Then there are the peripheral concerns like speed of pci bus, size of level two cache and number of cpus. The standard profiling tools don’t quite give you enough information.

Here’s actual output of three consecutive iostat executions:

Time: 05:11:56 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.92    0.00    5.36   21.74    0.00   66.99

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00    10.00    0.00    3.00     0.00     0.05    37.33     0.03    8.53   5.33   1.60
sdb               0.00   788.40    0.00  181.40     0.00     3.91    44.12     4.62   25.35   5.46  98.96
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    2.40     0.00     0.01     8.00     0.02    8.00   1.33   0.32
dm-3              0.00     0.00    0.00    2.40     0.00     0.01     8.00     0.01    5.67   2.33   0.56
dm-4              0.00     0.00    0.00    0.80     0.00     0.00     8.00     0.01   12.00   6.00   0.48
dm-5              0.00     0.00    0.00    7.60     0.00     0.03     8.00     0.08   10.32   1.05   0.80
hda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00  975.00     0.00     3.81     8.00    20.93   21.39   1.01  98.96
dm-7              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Time: 05:12:01 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.05    0.00    4.34   19.98    0.00   70.64

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00    10.80    0.00    2.80     0.00     0.05    40.00     0.03   10.57   6.86   1.92
sdb               0.00   730.60    0.00  164.80     0.00     3.64    45.20     3.37   20.56   5.47  90.16
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    2.60     0.00     0.01     8.00     0.03   12.31   2.15   0.56
dm-3              0.00     0.00    0.00    2.40     0.00     0.01     8.00     0.02    6.33   3.33   0.80
dm-4              0.00     0.00    0.00    0.80     0.00     0.00     8.00     0.01    9.00   5.00   0.40
dm-5              0.00     0.00    0.00    7.60     0.00     0.03     8.00     0.10   13.37   1.16   0.88
hda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00  899.60     0.00     3.51     8.00    16.18   18.03   1.00  90.24
dm-7              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Time: 05:12:06 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.91    0.00    1.36   10.83    0.00   85.89

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     6.40    0.00    3.40     0.00     0.04    25.88     0.04   12.94   5.18   1.76
sdb               0.00   303.40    0.00   88.20     0.00     1.59    36.95     1.83   20.30   5.48  48.32
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    2.60     0.00     0.01     8.00     0.04   14.77   2.46   0.64
dm-3              0.00     0.00    0.00    0.60     0.00     0.00     8.00     0.00   12.00   5.33   0.32
dm-4              0.00     0.00    0.00    0.80     0.00     0.00     8.00     0.01   11.00   5.00   0.40
dm-5              0.00     0.00    0.00    5.80     0.00     0.02     8.00     0.08   12.97   1.66   0.96
hda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00  393.00     0.00     1.54     8.00     6.46   16.03   1.23  48.32
dm-7              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Device sdb has reached crazy high utilization levels – 98% before dropping back down to 48%. An average queue size of 4.62 in the first run means a lot of queued up processes awaiting I/O. Write requests (merged) per second of 788 seems respectable. All this, while the CPU is 67% idle!

The conclusion: a solid state drive is in order. We are dying thirsting for I/O more than for CPU. But solid state drives cost money and have to be justified which takes time. Can we do something which proves it will bear out our hypothesis and really alleviate the problem? Yes! SSD is like accessing memory. So let’s build a virtual partition from our memory. tmpfs has made this sinfully easy:

mount -t tmpfs none /mqueue -o size=8192m

We set this to be sendmail’s queue directory. The sendmail mc command looks like this:

define(`QUEUE_DIR',`/mqueue/q*')dnl

which I need to further explain at some point.

Now it’s interesting that this tmpfs filesystem doesn’t even show up in iostat! I guess its usage all counts as cpu usage.

I now have to send my mail blast to the system with this tmpfs setup. I’m expecting to have essentially converted my lack of I/O into better usage of spare CPU, resulting in a higher-performance system.

The Results
The results are in and they are dramatic. Previous results using traditional 15K rotating drive:

- disk device became 98% busy
- cpu idle time only dropped as low as 69%
- load average peaked at 37
- SMTP port shut down for some minutes
- 2030 messages accepted in 187 seconds
- 11 messages/second

and now using tmpfs virtual filesystem:

- the load average rose to 3.1 - a much more tolerable result
- the cpu idle time dropped to 32% during the busiest time
- most imporantly, the server stayed open for business - the SMTP port did not shut down for the first time!!
- the 2000 messages were accepted in 34 seconds.  
- that's a record 59 messages/second!

Conclusion
Disk I/O was definitely the bottleneck for sendmail. tmpfs rocks! sendmail becomes five times faster using it, and is better behaved. The drawback of this filesystem type is that it is completely volatile and I stand to lose messages if the power ever goes out!

Case Closed!