[Server-devel] Ejabberd CPU/RAM Spike -> Crashes

Thu Dec 17 15:32:29 EST 2009

The server had an uptime of about 50 days before this occurred.  There were
no problems and nothing has changed in the 2 or so days since this problem
began.  Like had said previously, it seems to have occurred since reflashing
and re-registering a student's XO, but I believe that to be a coincidence.

> - Are you perhaps using an AP that does its own DHCP? One way to
> check for certain is to connect an XO, and then grep /var/lib/dhcpd/
> (or is it /var/spool/dhcpd/ ?) for the MAC address of the XO....

We are using 5 wireless AP's.  4 of which are Linksys WRT54G's running
DD-WRT and one is a D-Link modem/AP combo.  DHCP is deactivated on all of
the above.

> - Did you also leave XOs running connected to it, or were XOs
> completely disconnected?

I believe all XO's were disconnected.  It is possible some were left
connected while in their charging cabinets, but doubtful.

>Is there anything else that could be odd or non-standard in your
>setup? Are you in a VM? Is eth0 on the XS configured via dhcp with a
>short lease? Is there anything in the network between the XOs and the
>XS?

Nothing non-standard really.  eth0 is fixed.  Although, this server came
pre-installed from the folks involved with the Give One Get One program in
Rwanda.  I'm not sure what was modified from the stock server install.  I am
debating reinstalling the server from scratch.

I haven't been paying as much attention to the server lately as I should.
As it had been running for about 50 days, I only checked in with the school
periodically.  There were problems but mainly in relation to the presence
service and reliably connecting 30 - 100 laptops to the network at one
time.  I attribute this behavior to the Linksys AP's as they only seem to
handle about 20 connections per AP reliably.  There is also a good amount of
wireless interference to contend with; however, the server was working
well.  As it is a bit under-powered, load averages generally stay within the
1.2-1.5 range.

As I write this, the server has an uptime of about 9 hours.  Load averages
have reached 25 across the board.  The dump files have consumed over a gig
of space filling up the root partition.

>while true; do (echo `date -u `; vmstat; ps_mem.py | grep ejabberd;
>ejabberdctl connected-users | wc-l) >> mylog ; sleep 60 ; done;

Tried the script at night with the high load, and it cannot complete as the
ejabberd node has since crashed.  ejabberdctl yields the following error:

_________________________________________________________________________
RPC failed on the node ejabberd at schoolserver: {'EXIT',
                                               {badarg,
                                                [{ets,lookup,
                                                  [hooks,
                                                   {ejabberd_ctl_process,
                                                    global}]},

{ejabberd_hooks,run_fold,4},
                                                 {ejabberd_ctl,process,1},
                                                 {rpc,
                                                  '-handle_call/3-fun-0-',
                                                  5}]}}
__________________________________________________________________

Individually issuing the commands:
# vmstat
Thu Dec 17 20:07:19 UTC 2009
procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
wa st
25  0 705768  63912 123132 239040   53   92   153   711 1089  539 61 38  0
1  0

# ps_mem.py | grep ejabberd

No output

I've included a screenshot of htop for your viewing pleasure.

http://omploader.org/vMzBvZQ/htop_screen.jpg

I'll give you more relevant info tomorrow.

On Thu, Dec 17, 2009 at 12:16 PM, Martin Langhoff <martin.langhoff at gmail.com
> wrote:

> On Thu, Dec 17, 2009 at 1:12 PM, Martin Langhoff
> <martin.langhoff at gmail.com> wrote
> > On Thu, Dec 17, 2009 at 11:35 AM, Devon Connolly <devcon at gmail.com>
> wrote:
> >> XS Version: 0.6
> >> 1 GB Physical Ram, 2GB Swap
> >
> > Ok - the RAM is on the low side for an XS but should handle 150 ok.
> >
> >> # ejabberdctl connected-users
> > ...
> > I counted 12 lines in the output of connected-users. That should not
> > cause trouble.
>
> Also - can you get your hands on ps_mem.py, and run it when the
> machine is getting into trouble? I want to correlate the output of
> ps_mem.py for ejabberd vs the number of connected users, run something
> like this on a console
>
> while true; do (echo `date -u `; vmstat; ps_mem.py | grep ejabberd;
> ejabberdctl connected-users | wc-l) >> mylog ; sleep 60 ; done;
>
> untested, may need tweaking to work properly. If you run it during the
> day and also during the night, will be most interesting.
>
> cheers,
>
>
> m
> --
>  martin.langhoff at gmail.com
>  martin at laptop.org -- School Server Architect
>  - ask interesting questions
>  - don't get distracted with shiny stuff  - working code first
>  - http://wiki.laptop.org/go/User:Martinlanghoff
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.laptop.org/pipermail/server-devel/attachments/20091217/9af8f1ca/attachment.htm