Intro
I’m just creating this post to have documented what happened to me personally. I have a CentOS 6.0 image with Amazon AWS. It was based on a minimal image, which I purposefully selected so it wouldn’t be loaded down with junky daemons. Ran fine for a year, then one day nothing!
The details
I think it was up for 400 days consecutive! That’s not necessarily a good idea, but those are the facts. Then over the weekend I could neither ssh nor access its web server. Oh, oh. You’ve got really limited options at that point with a cloud server. I stopped it from the AWS console and then started it. No joy. More drastic action – really the last thing I can do short of abandoning the whole image: Terminate. After some breath-holding moments, and after I remembered to re-associate the elastic IP, it came up. Whew! Now it came up as CentOS v 6.4, which I don’t fully understand, but it works.
I checked the /var/log/messages file for clues as to what happened. There actuially were some pretty good clues. Here is the last of many, many similar lines I observed:
... Nov 29 08:39:23 ip-10-114-206-104 kernel: Out of memory: kill process 29076 (httpd) score 107231 or a child Nov 29 08:39:23 ip-10-114-206-104 kernel: Killed process 31306 (httpd) vsz:264320kB, anon-rss:30852kB, file-rss:312kB Nov 29 08:39:23 ip-10-114-206-104 kernel: httpd invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0 Nov 29 08:39:23 ip-10-114-206-104 kernel: httpd cpuset=/ mems_allowed=0 Nov 29 08:39:23 ip-10-114-206-104 kernel: Pid: 31506, comm: httpd Not tainted 2.6.32-131.17.1.el6.x86_64 #1 Nov 29 08:39:23 ip-10-114-206-104 kernel: Call Trace: Nov 29 08:39:23 ip-10-114-206-104 kernel: [<ffffffff810c00f1>] ? cpuset_print_task_mems_allowed+0x91/0xb0 Nov 29 08:39:23 ip-10-114-206-104 kernel: [<ffffffff811102bb>] ? oom_kill_process+0xcb/0x2e0 Nov 29 08:39:23 ip-10-114-206-104 kernel: [<ffffffff81110880>] ? select_bad_process+0xd0/0x110 Nov 29 08:39:23 ip-10-114-206-104 kernel: [<ffffffff81110918>] ? __out_of_memory+0x58/0xc0 Nov 29 08:39:23 ip-10-114-206-104 kernel: [<ffffffff81110b19>] ? out_of_memory+0x199/0x210 Nov 29 08:39:23 ip-10-114-206-104 kernel: [<ffffffff81120262>] ? __alloc_pages_nodemask+0x812/0x8b0 Nov 29 08:39:23 ip-10-114-206-104 kernel: [<ffffffff8115473a>] ? alloc_pages_current+0xaa/0x110 Nov 29 08:39:23 ip-10-114-206-104 kernel: [<ffffffff8110d717>] ? __page_cache_alloc+0x87/0x90 Nov 29 08:39:23 ip-10-114-206-104 kernel: [<ffffffff81122bab>] ? __do_page_cache_readahead+0xdb/0x210 Nov 29 08:39:23 ip-10-114-206-104 kernel: [<ffffffff81122d01>] ? ra_submit+0x21/0x30 Nov 29 08:39:23 ip-10-114-206-104 kernel: [<ffffffff8110e9e3>] ? filemap_fault+0x4c3/0x500 Nov 29 08:39:23 ip-10-114-206-104 kernel: [<ffffffff810061af>] ? xen_set_pte_at+0xaf/0x170 Nov 29 08:39:23 ip-10-114-206-104 kernel: [<ffffffff81137204>] ? __do_fault+0x54/0x510 Nov 29 08:39:23 ip-10-114-206-104 kernel: [<ffffffff811377b7>] ? handle_pte_fault+0xf7/0xb50 Nov 29 08:39:23 ip-10-114-206-104 kernel: [<ffffffff81007c4f>] ? xen_restore_fl_direct_end+0x0/0x1 Nov 29 08:39:23 ip-10-114-206-104 kernel: [<ffffffff81006d4b>] ? xen_set_pmd_hyper+0x8b/0xc0 ... |
So it ran out of memory. I guess there’s a memory leak somewhere, although another posting I saw hinted at a flaw in the CentOS under paravirtualization. I have no idea.
The interesting thing to me is that the error was ongoing for days. So I had I been watching for it, I could have been pro-active in rebooting my server.
Conclusion
My AWS-hosted CentOS VM gave me a scare when it stopped responding. I had to terminate it. An out-of-memory error in the kernel seems to be the proximate cause.