[OLPC/Paraguay] Debugging NetworkManager-0.7.2.995 + Power Management

Dan Williams dcbw at redhat.com
Tue Mar 30 13:26:40 EDT 2010


On Tue, 2010-03-30 at 14:07 -0300, Bernie Innocenti wrote:
> On Tue, 2010-03-30 at 09:16 -0700, Dan Williams wrote:
> > Yeah, I haven't been able to reproduce that, but I only have an XO1.5
> > with me (which doesn't have mesh), and I think the issue is timing
> > related.  So it's good that you can reproduce it. That gives us a chance
> > to fix it.
> 
> We could definitely reproduce the issue also on the XO-1.5, using os115
> from OLPC:

But... the 1.5 doesn't have the mesh capability (unless you're using a
dongle?) and thus wouldn't hit the codepath specified, no?

Unless there's some really, really egregious bug in the original mesh
code in NM that calls into the mesh device without any mesh device
having ever been created, that is...

>  http://build.laptop.org/10.2.0/os115
> 
> Just leave the automatic power management on and let the system go to
> sleep a few times.

Sugar or GNOME?

At least if you can reproduce with a 1.5 then I have chance of seeing it
in action too.

Dan

> 
> > So what is going on is that the mesh device and the wifi device are
> > obviously the "same" device, because they really share the same silicon.
> > So we need to make sure the mesh device knows about it's companion wifi
> > device.
> > 
> > Can I get some /var/log/messages logging from NetworkManager over a
> > suspend/resume cycle?  I'd like to see if the kernel removes either the
> > wifi or the mesh device and then adds it back after resume, or whether
> > the device sticks around throughout the entire cycle.
> 
> >From what I've seen in gdb after the segfault, it looked very much like
> the is_companinion() callback had being invoked an object that has been
> already freed. The GObject class was zeroed out.
> 
> I'll wait for Martin (tch) to come back from lunch to send you the full
> log.
> 
> 
> > This segfault indicates one of two things:
> > 
> > 1) a reference counting issue; there's a missing g_object_ref()
> > somewhere, which means that the Mesh object is getting unreffed one too
> > many times, leading to its destruction
> >
> > 2) the kernel is removing the underlying device and there's a missing
> > idle, timeout, or signal clear command.  The mesh device listens for a
> > number of signals of other objects, but when the mesh object gets
> > destroyed, we need to remember to stop listening for those signals or
> > we'll end up in the signal handler after the object is destroyed
> > ("use-after-free").
> 
> We looked for paths where the signal is being attached and removed, and
> it *looked* like removal was being done correctly on disposal. Tch added
> some debug output to track the life-cycle of objects.
> 
> 
> > Can I also get a backtrace of the crash?  Bernie's backtrace didn't have
> > debugging info, which your clearly have.
> 
> Tch will follow up with the complete backtrace, he has a debugging
> environment where NM was built from sources.
> 





More information about the Devel mailing list