[Server-devel] Ejabberd CPU/RAM Spike -> Crashes

Thu Dec 17 19:59:52 EST 2009

On Thu, Dec 17, 2009 at 9:32 PM, Devon Connolly <devcon at gmail.com> wrote:
> The server had an uptime of about 50 days before this occurred.  There were
> no problems and nothing has changed in the 2 or so days since this problem
> began.  Like had said previously, it seems to have occurred since reflashing
> and re-registering a student's XO, but I believe that to be a coincidence.

Hmmm, maybe something's gone wonky on the mnesia DB.

> We are using 5 wireless AP's.  4 of which are Linksys WRT54G's running
> DD-WRT and one is a D-Link modem/AP combo.  DHCP is deactivated on all of
> the above.

Good.

>> - Did you also leave XOs running connected to it, or were XOs
>> completely disconnected?
>
> I believe all XO's were disconnected.  It is possible some were left
> connected while in their charging cabinets, but doubtful.

Ok. Then ejabberd is getting messedup all on its own...

> Nothing non-standard really.  eth0 is fixed.

good

> Although, this server came
> pre-installed from the folks involved with the Give One Get One program in
> Rwanda.  I'm not sure what was modified from the stock server install.  I am
> debating reinstalling the server from scratch.

Don't reinstall. If possible, let's try to debug this. If you're going
to give up, just

1 - Backup /var/lib/ejabberd -- just tar it up
2 - Use the 'domain_config' script to change the domain -- this will
re-generate the ejabberd mnesia database. What I'd do: change it to
'foo.com' and then back to the right domain.

> I attribute this behavior to the Linksys AP's as they only seem to
> handle about 20 connections per AP reliably.

yeah. we've seen that plenty.

>  There is also a good amount of
> wireless interference to contend with; however, the server was working
> well.

I assume you have the different APs in different channels, and
generally avoid channel 1 (as that's where XOs engage in 'mesh' by
default...)...

>>while true; do (echo `date -u `; vmstat; ps_mem.py | grep ejabberd;
>>ejabberdctl connected-users | wc-l) >> mylog ; sleep 60 ; done;
>
> Tried the script at night with the high load, and it cannot complete as the
> ejabberd node has since crashed.  ejabberdctl yields the following error:

Can you restart ejabberd and try that script?

> # ps_mem.py | grep ejabberd
>
> No output

Did you download ps_mem.py, and make it executable? (google the name
if needed) If so, you might want to grep for erl instead.

> I've included a screenshot of htop for your viewing pleasure.
> http://omploader.org/vMzBvZQ/htop_screen.jpg

ejbabberd sure looks busy there...

m
-- 
 martin.langhoff at gmail.com
 martin at laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff