Reason for the "one dot" hang found!

Thu Jun 10 17:32:52 EDT 2010

El Thu, 10-06-2010 a las 16:53 -0300, Daniel Drake escribió:

> >   1 tty1     Ss+    0:02 /sbin/init
> >  945 ?        Ss     0:00 /bin/sh -e -c ?runlevel --set S >/dev/null || true???/
> >  950 ?        S      0:00  \_ /bin/bash /etc/rc.d/rc.sysinit
> > 1597 ?        D      0:00      \_ modprobe scsi_wait_scan
> 
> I strongly doubt this is the issue. This is a very simple module.
> 
> Note your other blocked process:
> 
> > 1035 ?        D<     0:00 /sbin/modprobe -b pci:v000011ABd00004102sv000011ABsd00
> 
> This one also has a lower process ID, suggesting that it was run first.
> 
> I suspect there is a crash/hang within this module, and at this point,
> attempting to load any other module (scsi_wait_scan or otherwise) will
> hang. Due to contention on a lock, corruption, a dead kernel thread,
> or something like that.

Ok, makes sense. If one module hangs during init, any subsequent
invocation of modprobe would also hang.

> My suggested next steps in diagnosis:
>  1. Identify which device is pci:v000011ABd00004102
> Anyone can do this on any XO-1 with: lspci -vd 11ab:4102
> I'm pretty sure its a part of the CAFE chip but I don't have an XO to check.

It's the camera controller. Hence, the other module being loaded must be
cafe_ccic.

Looking at the initialization of cafe_ccic, there seems to be a
complicated dance of mutexes and spin locks, plus a kernel thread and a
bunch of sleeps. All the ingredients for a good deadlock are present :-)

Jonathan, can you make your best guess?

>  2. Look at dmesg at point of crash
> Considering that you got a process tree I guess you can also run some
> other commands at point of hang?
> Run "dmesg" and capture output.

I did, but there was nothing interesting in dmesg, which is what I would
expect from a pure locking bug. Moreover, CONFIG_DEBUG_MUTEXES is turned
off.

Perhaps interestingly, on regular boots, I can see some psmouse
initialization messages intermixed with the cafe_ccic ones.

> 3. Capture kernel task dump at point of crash
> echo t > /proc/sysrq-trigger
> The task dump will appear in kernel logs (dmesg).

Ok, I'll do it as soon as I see it again.

BTW: this bug seems to be easier to trigger by forcing a shutdown while
some data is being written to disk.

-- 
   // Bernie Innocenti - http://codewiz.org/
 \X/  Sugar Labs       - http://sugarlabs.org/